Job Summary
We are seeking a highly skilled
Site Reliability Engineer (SRE) to architect, operate, and scale Kubernetes-based infrastructure across
on-premises and cloud environments. This role emphasizes
manual application deployments,
observability,
security, and
uptime accountability, while promoting resilience, automation, and operational excellence.
You will work closely with cross-functional teams to ensure performance, availability, and reliability of mission-critical services, while evolving platform capabilities using modern tools like Terraform, Helm, and Argo CD.
Key Responsibilities
- Design, build, and manage highly available Kubernetes clusters across hybrid environments (on-premises and cloud platforms such as AWS EKS, Azure AKS).
- Deploy and manage applications manually using tools such as kubectl and Helm, with growing integration of GitOps practices (e.g., ArgoCD).
- Implement and manage observability stacks using Prometheus, Grafana, Loki, and Mimir to monitor infrastructure, applications, and system performance.
- Define, monitor, and improve SLA/SLO/SLI metrics and alerting systems to ensure platform reliability.
- Automate provisioning and configuration of infrastructure using Terraform, Helm, and scripting languages (e.g., Bash, Python).
- Plan, implement, and test backup and disaster recovery (DR) strategies using tools like Velero, Commvault, etc.
- Manage Kubernetes-native networking, storage, and security configurations (Ceph, NFS, Ingress, PodSecurityPolicies, etc.).
- Configure and enforce Kubernetes security best practices using RBAC, OPA/Gatekeeper, NetworkPolicies, and secrets management tools.
- Integrate and operate Kubernetes ecosystem tools such as Karpenter, MicroK8s, Service Meshes, and kubectl plugins.
- Conduct root cause analysis (RCA) and lead resolution efforts for incidents.
- Participate in the on-call rotation for platform availability and incident management.
- Maintain up-to-date documentation, architecture diagrams, runbooks, and SOPs.
- Mentor engineers and advocate for Kubernetes, security, observability, and deployment best practices across teams.
- Continuously stay informed of industry trends in container orchestration, GitOps, security, and cloud-native tooling.
Required Qualifications
- 79 years of IT/Infrastructure/DevOps experience, with 5+ years in Kubernetes operations in production environments.
- Strong hands-on experience in Kubernetes architecture, cluster operations, and manual application deployment practices.
- Intermediate-level experience in Kubernetes Security, including:
- Cluster hardening, secrets management
- Pod Security Standards (PSS), OPA/Gatekeeper
- Network policies, image scanning, and runtime protections
- Intermediate experience with ArgoCD for GitOps-style Kubernetes deployments.
- Solid proficiency in Linux system administration (Ubuntu, CentOS, RHEL) and troubleshooting.
- Hands-on experience with Kubernetes-native storage (e.g., Ceph, NFS) and persistent volume provisioning.
- Strong familiarity with observability tools: Grafana, Prometheus, Loki, Mimir, etc.
- Proficiency in Infrastructure as Code using Terraform, Helm, and scripting.
- Experience with Velero, Commvault, or similar for backup and DR.
- Experience operating and optimizing cloud-native Kubernetes platforms like EKS, AKS.
- Exposure to tools like Karpenter, MicroK8s, Service Mesh, and Ingress Controllers.
- Familiarity with AI/ML workloads running on Kubernetes is a plus.
- Excellent collaboration, communication, documentation, and incident resolution skills.
Preferred Qualifications
- Kubernetes certifications: CKA, CKAD, or CKS.
- Strong understanding of container security, networking, and distributed system architecture.
- Experience using Portainer for container and Kubernetes management.
- Advanced knowledge of Grafana and other enterprise-grade observability tools.
- Experience managing large-scale Kubernetes clusters (200+ nodes) is highly preferred.
- Prior experience supporting production-grade, high-availability platforms and environments.
Why Join Us
- Help shape and operate mission-critical, modern Kubernetes infrastructure.
- Be part of a team focused on platform reliability, observability, and secure operations.
- Contribute to and influence the evolution of deployment and automation practices (GitOps, IaC).
- Access cutting-edge tools, industry best practices, and continuous learning.
Enjoy
competitive compensation,
flexible working options, and a
growth-focused engineering culture