We are seeking a highly skilled and motivated Senior DevOps Engineer to join our team. The ideal candidate will have 4-6 years of DevOps experience with a strong focus on on-premises or self-managed Kubernetes environments. This role involves deploying, operating, monitoring, and troubleshooting Kubernetes clusters and Linux infrastructure to ensure high availability, reliability, and performance of production systems.
Responsibilities
- Administer and support Linux-based systems in production environments.
- Deploy, manage, and troubleshoot applications running on Linux.
- Perform root cause analysis for OS-level issues to maintain high availability and performance.
- Ensure system stability, security hardening, and performance tuning.
- Deploy, configure, and maintain on-premises or self-managed Kubernetes clusters (bare metal or VM-based).
- Troubleshoot Kubernetes issues related to: Pod scheduling, networking, and storage; Cluster components and service failures; Application deployment and scaling.
- Debug containerized workloads and ensure reliable rollouts.
- Manage Kubernetes resources such as Deployments, Services, ConfigMaps, Secrets, and CronJobs.
- Work with ARGO CD / ARGO Workflows for Kubernetes-native application delivery and workflows (mandatory).
- Implement and maintain Prometheus and Grafana for infrastructure and application monitoring.
- Create and manage real-time Grafana dashboards for cluster health, application metrics, and alerts.
- Analyze monitoring data to proactively identify and resolve performance and reliability issues.
- Support incident response using observability tools.
- Configure and manage Kubernetes CronJobs and Linux-based scheduled tasks.
- Troubleshoot failed or delayed automation jobs.
- Improve operational efficiency through scripting and automation.
- Develop automation using Shell, Python, or Ansible.
- Gain working knowledge of Horizon/platform portals used for infrastructure or operational visibility.
- Monitor and track infrastructure health and incidents using internal portals.
- Utilize portals for operational insights and incident management.
- Understand basic cloud computing concepts and architectures.
- Provide support or troubleshooting for cloud-related dependencies when required.
- Note: This role is primarily focused on on-prem Kubernetes, not public cloud operations
Requirements
- Bachelor's degree in computer science, IT, or a related field (or equivalent hands-on experience).
- 4-6 years of experience in a DevOps / SRE / Production Support role.
- Strong expertise in Linux system administration.
- Hands-on experience with on-prem, self-managed, or unmanaged Kubernetes clusters.
- Proven ability to deploy, debug, and troubleshoot Kubernetes environments.
- Strong experience with Prometheus and Grafana.
- Mandatory exposure to ARGO CD / ARGO Workflows.
- Experience with automation and scripting (Shell, Python, Ansible).
- Ability to handle production incidents independently.
- Excellent troubleshooting, analytical, and communication skills.
This job was posted by Sajal Saxena from CorEdge.