We are looking for a skilled
L2 techops role with 4+ years of experience in managing large-scale production systems. The ideal candidate should have a strong background in
Linux servers,
SQL, and
big data tools (Hive, Spark), along with hands-on experience in
monitoring, troubleshooting, and automation.
Key Responsibilities
- Manage and maintain production environments ensuring high availability and reliability.
- Perform system monitoring, performance tuning, and capacity planning.
- Analyze and debug production issues by leveraging Airflow logs, Spark UI, and Hive query performance metrics.
- Build and maintain dashboards and alerts in Grafana and Kibana for proactive monitoring and issue detection.
- Monitor and troubleshoot OCP (OpenShift Container Platform) clusters and associated components.
- Write and optimize SQL queries to analyze and troubleshoot data issues.
- Collaborate with development, data engineering, and operations teams to ensure system reliability and scalability.
- Participate in on-call rotations and incident management processes.
- Automate routine operational tasks using scripting (Shell, Python, etc.).
- Ensure adherence to best practices in observability, monitoring, and incident response.
Required Skills & Experience
- 4–6 years of experience as an SRE, DevOps Engineer, or similar role.
- Strong expertise in Linux system
- Solid understanding of SQL with the ability to write and optimize queries.
- Good working knowledge of Hive and Spark; ability to use Spark UI for debugging performance issues.
- Hands-on experience in monitoring and analyzing logs using Kibana and Grafana.
- Experience in Airflow log analysis and DAG issue resolution.
- Familiarity with OCP (OpenShift) or other Kubernetes-based platforms for cluster monitoring.
- Strong analytical, debugging, and problem-solving skills.
- Scripting skills in Shell or Python for automation.
- Understanding of CI/CD and deployment best practices is a plus.
- good working knowledge with querying tools like Jupyterhub,metabase
Preferred Qualifications
- Experience with cloud platforms (AWS, GCP, or Azure).
- Knowledge of Prometheus, Elastic Stack, or similar observability tools.
- Exposure to incident management and postmortem analysis.
- Familiarity with big data pipelines and distributed systems.