Site Reliability Engineer - 2

Coredge.io

Bengaluru, India

3-5 Years

Save

Posted 4 days ago
Be among the first 10 applicants

Early Applicant

Job Description

We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our team. The ideal candidate will have 3-5 years of DevOps/SRE experience with a strong focus on on-premises or self-managed Kubernetes environments. This role involves deploying, operating, monitoring, and troubleshooting Kubernetes clusters and Linux infrastructure to ensure high availability, reliability, and performance of production systems.

Responsibilities

Administer and support Linux-based systems in production environments.
Deploy, manage, and troubleshoot applications running on Linux.
Perform root cause analysis for OS-level issues to maintain high availability and performance.
Ensure system stability, security hardening, and performance tuning.
Deploy, configure, and maintain on-premises or self-managed Kubernetes clusters (bare metal or VM-based).
Troubleshoot Kubernetes issues related to: Pod scheduling, networking, and storage, Cluster components and service failures, Application deployment and scaling
Debug containerised workloads and ensure reliable rollouts.
Manage Kubernetes resources such as Deployments, Services, ConfigMaps, Secrets, and CronJobs. Work with ARGO CD / ARGO Workflows for Kubernetes-native application delivery and workflows (mandatory).
Implement and maintain Prometheus and Grafana for infrastructure and application monitoring.
Create and manage real-time Grafana dashboards for cluster health, application metrics, and alerts.
Analyse monitoring data to proactively identify and resolve performance and reliability issues.
Support incident response using observability tools.
Configure and manage Kubernetes CronJobs and Linux-based scheduled tasks.
Troubleshoot failed or delayed automation jobs.
Improve operational efficiency through scripting and automation.
Develop automation using Shell, Python, or Ansible.
Gain working knowledge of Horizon/platform portals used for infrastructure or operational visibility.
Monitor and track infrastructure health and incidents using internal portals.
Utilise portals for operational insights and incident management.
Understand basic cloud computing concepts and architectures.
Provide support or troubleshooting for cloud-related dependencies when required.
Note: This role is primarily focused on on-prem Kubernetes, not public cloud operations

Requirements

Bachelor's degree in computer science, IT, or a related field (or equivalent hands-on experience).
3-5 years of experience in a DevOps / SRE / Production Support role.
Strong expertise in Linux system administration.
Hands-on experience with on-prem, self-managed, or unmanaged Kubernetes clusters.
Proven ability to deploy, debug, and troubleshoot Kubernetes environments.
Strong experience with Prometheus and Grafana.
Mandatory exposure to ARGO CD / ARGO Workflows.
Experience with automation and scripting (Shell, Python, Ansible).
Ability to handle production incidents independently.
Excellent troubleshooting, analytical, and communication skills.

Preferred Qualifications