Site Reliability Engineer - 2

coredge.io

Bengaluru, India

3-5 Years

Save

Posted 2 months ago
Be among the first 40 applicants

Early Applicant

Job Description

We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our team. The ideal candidate will have 3-5 years of DevOps/SRE experience with a strong focus on on-premises or self-managed Kubernetes environments. This role involves deploying, operating, monitoring, and troubleshooting Kubernetes clusters and Linux infrastructure to ensure high availability, reliability, and performance of production systems.

Responsibilities

Administer and support Linux-based systems in production environments.
Deploy, manage, and troubleshoot applications running on Linux.
Perform root cause analysis for OS-level issues to maintain high availability and performance.
Ensure system stability, security hardening, and performance tuning.
Deploy, configure, and maintain on-premises or self-managed Kubernetes clusters (bare metal or VM-based).
Troubleshoot Kubernetes issues related to: Pod scheduling, networking, and storage, Cluster components and service failures, Application deployment and scaling
Debug containerised workloads and ensure reliable rollouts.
Manage Kubernetes resources such as Deployments, Services, ConfigMaps, Secrets, and CronJobs. Work with ARGO CD / ARGO Workflows for Kubernetes-native application delivery and workflows (mandatory).
Implement and maintain Prometheus and Grafana for infrastructure and application monitoring.
Create and manage real-time Grafana dashboards for cluster health, application metrics, and alerts.
Analyse monitoring data to proactively identify and resolve performance and reliability issues.
Support incident response using observability tools.
Configure and manage Kubernetes CronJobs and Linux-based scheduled tasks.
Troubleshoot failed or delayed automation jobs.
Improve operational efficiency through scripting and automation.
Develop automation using Shell, Python, or Ansible.
Gain working knowledge of Horizon/platform portals used for infrastructure or operational visibility.
Monitor and track infrastructure health and incidents using internal portals.
Utilise portals for operational insights and incident management.
Understand basic cloud computing concepts and architectures.
Provide support or troubleshooting for cloud-related dependencies when required.
Note: This role is primarily focused on on-prem Kubernetes, not public cloud operations

Requirements

Bachelor's degree in computer science, IT, or a related field (or equivalent hands-on experience).
3-5 years of experience in a DevOps / SRE / Production Support role.
Strong expertise in Linux system administration.
Hands-on experience with on-prem, self-managed, or unmanaged Kubernetes clusters.
Proven ability to deploy, debug, and troubleshoot Kubernetes environments.
Strong experience with Prometheus and Grafana.
Mandatory exposure to ARGO CD / ARGO Workflows.
Experience with automation and scripting (Shell, Python, Ansible).
Ability to handle production incidents independently.
Excellent troubleshooting, analytical, and communication skills.

Preferred Qualifications

Kubernetes certifications, such as CKA or CKAD.
Experience with CI/CD pipelines integrated with Kubernetes.
Exposure to container security, RBAC, and cluster hardening.
Experience supporting high-availability on-prem infrastructure.

This job was posted by Sajal Saxena from CorEdge.

More Info

Job Type:

Permanent Job

Industry:

Other

Function:

Site Reliability Engineering

Employment Type:

Full time

About Company

coredge.ioJob Source: www.linkedin.com

Job ID: 144077039

Jobs by Skill - IT

Jobs by Skill - Non IT

International Jobs

Last Updated: 19-05-2026 07:46:01 PM

Homejobs in Bengaluru / BangaloreSite Reliability Engineer - 2

Similar Jobs