Search by job, company or skills

Coredge.io

Site Reliability Engineer - 2

new job description bg glownew job description bg glownew job description bg svg
  • Posted 4 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

We are seeking a highly skilled and motivated Site Reliability Engineer (SRE) to join our team. The ideal candidate will have 3-5 years of DevOps/SRE experience with a strong focus on on-premises or self-managed Kubernetes environments. This role involves deploying, operating, monitoring, and troubleshooting Kubernetes clusters and Linux infrastructure to ensure high availability, reliability, and performance of production systems.

Responsibilities

  • Administer and support Linux-based systems in production environments.
  • Deploy, manage, and troubleshoot applications running on Linux.
  • Perform root cause analysis for OS-level issues to maintain high availability and performance.
  • Ensure system stability, security hardening, and performance tuning.
  • Deploy, configure, and maintain on-premises or self-managed Kubernetes clusters (bare metal or VM-based).
  • Troubleshoot Kubernetes issues related to: Pod scheduling, networking, and storage, Cluster components and service failures, Application deployment and scaling
  • Debug containerised workloads and ensure reliable rollouts.
  • Manage Kubernetes resources such as Deployments, Services, ConfigMaps, Secrets, and CronJobs. Work with ARGO CD / ARGO Workflows for Kubernetes-native application delivery and workflows (mandatory).
  • Implement and maintain Prometheus and Grafana for infrastructure and application monitoring.
  • Create and manage real-time Grafana dashboards for cluster health, application metrics, and alerts.
  • Analyse monitoring data to proactively identify and resolve performance and reliability issues.
  • Support incident response using observability tools.
  • Configure and manage Kubernetes CronJobs and Linux-based scheduled tasks.
  • Troubleshoot failed or delayed automation jobs.
  • Improve operational efficiency through scripting and automation.
  • Develop automation using Shell, Python, or Ansible.
  • Gain working knowledge of Horizon/platform portals used for infrastructure or operational visibility.
  • Monitor and track infrastructure health and incidents using internal portals.
  • Utilise portals for operational insights and incident management.
  • Understand basic cloud computing concepts and architectures.
  • Provide support or troubleshooting for cloud-related dependencies when required.
  • Note: This role is primarily focused on on-prem Kubernetes, not public cloud operations

Requirements

  • Bachelor's degree in computer science, IT, or a related field (or equivalent hands-on experience).
  • 3-5 years of experience in a DevOps / SRE / Production Support role.
  • Strong expertise in Linux system administration.
  • Hands-on experience with on-prem, self-managed, or unmanaged Kubernetes clusters.
  • Proven ability to deploy, debug, and troubleshoot Kubernetes environments.
  • Strong experience with Prometheus and Grafana.
  • Mandatory exposure to ARGO CD / ARGO Workflows.
  • Experience with automation and scripting (Shell, Python, Ansible).
  • Ability to handle production incidents independently.
  • Excellent troubleshooting, analytical, and communication skills.

Preferred Qualifications

  • Kubernetes certifications, such as CKA or CKAD.
  • Experience with CI/CD pipelines integrated with Kubernetes.
  • Exposure to container security, RBAC, and cluster hardening.
  • Experience supporting high-availability on-prem infrastructure.

This job was posted by Sajal Saxena from CorEdge.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 144077039

Similar Jobs