About The Role
We are seeking a
Lead Data DevOps Engineer to design, build, and operate large-scale, distributed
data infrastructure on
AWS & Kubernetes (EKS) environments.
You will lead the automation, observability, and scalability strategy for our data ecosystem powered by:
- Apache Airflow, Apache NiFi, Kafka (Strimzi), RabbitMQ,
- AWS EMR (on EC2 and EKS), AWS Glue, and Athena,
- all running on EKS with Karpenter and fully managed through Infrastructure as Code (IaC).
This role blends deep
DevOps engineering expertise with
data platform operations, enabling efficient, secure, and highly available data pipelines at scale.
Key Responsibilities-
- Platform Automation & Infrastructure
- Design and automate data platform infrastructure across AWS and Kubernetes using Terraform, Ansible, and Helm.
- Deploy, manage, and optimize Kafka (via Strimzi Operator), RabbitMQ, and NiFi clusters on EKS.
- Automate EMR on EC2 and EMR on EKS deployments for both batch and streaming workloads.
- Implement Karpenter-based autoscaling and EKS node group optimization for compute efficiency.
- Build and manage CI/CD pipelines (Jenkins & GitLab CI) for data applications and platform components.
- Data Platform Enablement
- Collaborate with Data Engineering teams to operationalize Airflow DAGs, EMR, NiFi dataflows, and Kafka streaming pipelines.
- Collaborate with Data Engineering teams to operationalize Airflow DAGs, EMR, NiFi dataflows, and Kafka streaming pipelines.
- Support and automate AWS Glue jobs, ETL workflows, and Athena query infrastructure for data analytics and lake queries.
- Enable seamless integration between Kafka Glue EMR Athena pipelines for both real-time and batch workloads.
- Manage data lake and data warehouse automation workflows using IaC and GitOps pattern
- Observability, Reliability & Governance
- Implement end-to-end monitoring, alerting, and logging using Prometheus, Grafana, OpenSearch, and CloudWatch.
- Implement end-to-end monitoring, alerting, and logging using Prometheus, Grafana, OpenSearch, and CloudWatch.
- Define SLOs, SLIs, and SLAs for all critical data services and pipelines.
- Conduct capacity planning, cost optimization, and auto-scaling strategies for EMR, Glue, and EKS clusters.
- Enforce security controls, IAM policies, and data encryption standards across services.
- Collaborate with compliance teams to implement audit-ready data platform governance.
- Leadership & Collaboration
- Lead and mentor a small team of DevOps and Platform Engineers.
- Lead and mentor a small team of DevOps and Platform Engineers.
- Drive best practices for CI/CD, Infrastructure as Code, and environment automation.
- Collaborate with Data Engineering, Cloud, and Security teams to ensure seamless data platform operations.
- Evaluate emerging tools and propose continuous improvements for performance, reliability, and cost-efficiency.
Nice to Have
- Exposure of AI tools or Agents, MCP for data
- Experience with Istio, or NetworkPolicy security models.
- Prior experience in FinTech or Telecom-scale data environments.
- Familiarity with data quality / data observability tools like Datahub, Metabase and Schema registry.