We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our team. The ideal candidate will have 3-5 years of DevOps/SRE experience with a strong focus on on-premises or self-managed Kubernetes environments. This role involves deploying, operating, monitoring, and troubleshooting Kubernetes clusters and Linux infrastructure to ensure high availability, reliability, and performance of production systems.
Responsibilities
- Administer and support Linux-based systems in production environments.
- Deploy, manage, and troubleshoot applications running on Linux.
- Perform root cause analysis for OS-level issues to maintain high availability and performance.
- Ensure system stability, security hardening, and performance tuning.
- Deploy, configure, and maintain on-premises or self-managed Kubernetes clusters (bare metal or VM-based).
- Troubleshoot Kubernetes issues related to: Pod scheduling, networking, and storage, Cluster components and service failures, Application deployment and scaling
- Debug containerised workloads and ensure reliable rollouts.
- Manage Kubernetes resources such as Deployments, Services, ConfigMaps, Secrets, and CronJobs. Work with ARGO CD / ARGO Workflows for Kubernetes-native application delivery and workflows (mandatory).
- Implement and maintain Prometheus and Grafana for infrastructure and application monitoring.
- Create and manage real-time Grafana dashboards for cluster health, application metrics, and alerts.
- Analyse monitoring data to proactively identify and resolve performance and reliability issues.
- Support incident response using observability tools.
- Configure and manage Kubernetes CronJobs and Linux-based scheduled tasks.
- Troubleshoot failed or delayed automation jobs.
- Improve operational efficiency through scripting and automation.
- Develop automation using Shell, Python, or Ansible.
- Gain working knowledge of Horizon/platform portals used for infrastructure or operational visibility.
- Monitor and track infrastructure health and incidents using internal portals.
- Utilise portals for operational insights and incident management.
- Understand basic cloud computing concepts and architectures.
- Provide support or troubleshooting for cloud-related dependencies when required.
- Note: This role is primarily focused on on-prem Kubernetes, not public cloud operations
Requirements
- Bachelor's degree in computer science, IT, or a related field (or equivalent hands-on experience).
- 3-5 years of experience in a DevOps / SRE / Production Support role.
- Strong expertise in Linux system administration.
- Hands-on experience with on-prem, self-managed, or unmanaged Kubernetes clusters.
- Proven ability to deploy, debug, and troubleshoot Kubernetes environments.
- Strong experience with Prometheus and Grafana.
- Mandatory exposure to ARGO CD / ARGO Workflows.
- Experience with automation and scripting (Shell, Python, Ansible).
- Ability to handle production incidents independently.
- Excellent troubleshooting, analytical, and communication skills.
Preferred Qualifications
- Kubernetes certifications, such as CKA or CKAD.
- Experience with CI/CD pipelines integrated with Kubernetes.
- Exposure to container security, RBAC, and cluster hardening.
- Experience supporting high-availability on-prem infrastructure.
This job was posted by Sajal Saxena from CorEdge.