Pipeline Management: Maintain high-throughput streaming pipelines to ingest logs from various sources (Firewalls, Cloud, Endpoints) to a central destination.
Log Normalization: Write parsers to convert raw, messy logs into standard schemas (e.g., OCSF or ECS) for consistent querying.
Cost Optimization: Implement routing logic to send high-value data to the SIEM and bulk data to low-cost Object Storage (Data Lake).
Data Preparation: Clean and structure data to enable AI/ML detection models and advanced analytics.
Must-Have Skills
Data Engineering: Proficiency in Python (for ETL) and SQL (for complex querying).
Streaming Tech: Experience with Message Queues (e.g., Kafka, Pub/Sub) and stream processing concepts.
Log Handling: Mastery of Regex and log parsing strategies for standard formats (Syslog, CEF, JSON).
Storage Architecture: Understanding of Data Lake principles (Parquet/Avro formats) vs. Data Warehouses.
Preferred / Nice To Have
Experience with Vector Databases for storing embeddings.
Knowledge of Log Observability/Routing tools (middleware that routes logs).
Familiarity with Big Data frameworks (e.g., Spark, Flink).