
Search by job, company or skills

Key Responsibilities
• Ensure platform uptime and application health as per SLOs/KPIs
• Monitor infrastructure and applications using ELK, Prometheus, Zabbix, etc.
• Debug and resolve complex production issues, performing root cause analysis
• Automate routine tasks and implement self-healing systems
• Design and maintain dashboards, alerts, and operational playbooks
• Participate in incident management, problem resolution, and RCA documentation
• Own and update SOPs for repeatable processes
• Collaborate with L3 and Product teams for deeper issue resolution
• Support and guide L1 operations team
• Conduct periodic system maintenance and performance tuning
• Respond to user data requests and ensure timely resolution
• Address and mitigate security vulnerabilities and compliance issues Technical Skillset
• Hands-on with Spark, Hive, Cloudera Hadoop, Kafka, Ranger
• Strong Linux fundamentals and scripting (Python, Shell)
• Experience with Apache NiFi, Airflow, Yarn, and Zookeeper
• Proficient in monitoring and observability tools: ELK Stack, Prometheus, Loki
• Working knowledge of Kubernetes, Docker, Jenkins CI/CD pipelines
• Strong SQL skills (Oracle/Exadata preferred)
• Familiarity with DataHub, DataMesh, and security best practices is a plus
• Strong problem-solving and debugging mindset
• Ability to work under pressure in a fast-paced environment.
• Excellent communication and collaboration skills.
• Ownership, customer orientation, and a bias for actionKey Responsibilities
• Ensure platform uptime and application health as per SLOs/KPIs
• Monitor infrastructure and applications using ELK, Prometheus, Zabbix, etc.
• Debug and resolve complex production issues, performing root cause analysis
• Automate routine tasks and implement self-healing systems
• Design and maintain dashboards, alerts, and operational playbooks
• Participate in incident management, problem resolution, and RCA documentation
• Own and update SOPs for repeatable processes
• Collaborate with L3 and Product teams for deeper issue resolution
• Support and guide L1 operations team
• Conduct periodic system maintenance and performance tuning
• Respond to user data requests and ensure timely resolution
• Address and mitigate security vulnerabilities and compliance issues Technical Skillset
• Hands-on with Spark, Hive, Cloudera Hadoop, Kafka, Ranger
• Strong Linux fundamentals and scripting (Python, Shell)
• Experience with Apache NiFi, Airflow, Yarn, and Zookeeper
• Proficient in monitoring and observability tools: ELK Stack, Prometheus, Loki
• Working knowledge of Kubernetes, Docker, Jenkins CI/CD pipelines
• Strong SQL skills (Oracle/Exadata preferred)
Job ID: 114642855
Skills:
Jenkins, Docker, Spark, Kubernetes, Airflow, Cloudera Hadoop
Skills:
Databricks, Microservices, Tensorflow, Kafka, Opencv, Machine Learning, AWS, Pytorch, Kubernetes, Python, Azure, Gcp, Docker, Apis, Git, Spark, data pipelines, ONNX, AI-assisted engineering tools, Airflow, MLflow, DevOps MLOps practices, CI CD pipelines
Skills:
Bash Shell Scripting, Pyspark, Pandas, Cloudwatch, Terraform, Databricks, Python, AWS, CI CD, Unity Catalog, GitLab CI, Delta Lake
Skills:
Jfrog Artifactory, AWS Glue, Bash, Sql, Apache Airflow, Jenkins, Lambda, Azure Data Factory, Gcp, Docker, Terraform, Databricks, Azure, Python, AWS, Step Functions, GitHub Actions
We don’t charge any money for job offers