About The Opportunity
A technology services organization operating in the IT Services / HR Technology sector, delivering cloud-hosted platforms and managed infrastructure for enterprise customers. We build and run production-grade SaaS solutions focused on reliability, performance, and secure operations across public cloud environments. This role is for an on-site Site Reliability Engineer supporting critical production systems in India.
Role & Responsibilities
- Maintain service reliability and uptime for production systems through proactive monitoring, incident response, and root-cause analysis.
- Implement and operate infrastructure as code to provision, scale, and secure cloud resources across AWS environments.
- Design, build, and maintain container orchestration platforms, CI/CD pipelines, and automated deployment workflows.
- Develop and operate observability tooling (metrics, logs, traces) and dashboards to surface SLIs/SLOs and reduce MTTR.
- Automate repetitive operational tasks with scripts or small services and own runbooks for on-call rotations.
- Collaborate with development teams to improve application resiliency, capacity planning, and release practices.
Skills & Qualifications Must-Have
- Kubernetes
- Docker
- Linux
- AWS
- Terraform
- Prometheus
- Grafana
- Jenkins
Preferred
- Python
- Golang
- HashiCorp Vault
Additional Qualifications
- Proven experience operating production services with strong focus on reliability, automation, and observability.
- Familiarity with on-call practices, incident management workflows, and post-incident remediation.
- Ability to work on-site in India and collaborate across engineering, product, and support teams.
Benefits & Culture Highlights
- Hands-on, outcome-driven engineering culture with ownership of end-to-end production systems.
- Opportunity to influence architecture, tooling, and SRE practices for mission-critical platforms.
- Structured on-call support, knowledge-sharing forums, and career growth into platform engineering roles.
Skills: kubernetes,docker,aws,jenkins,prometheus,grafana,site reliability engineering,linux,python,terraform