Join us as we work to create a thriving ecosystem that delivers accessible, high-quality, and sustainable healthcare for all.
Role Summary:
We are seeking a Senior Site Reliability Engineer - SMTS at the Senior Associate level to join our team in Bangalore - Whitefield, working in a hybrid setting. This role offers the opportunity to lead efforts in enhancing the reliability, scalability, and observability of our cloud infrastructure and services. You will work closely with development teams to design, implement, and maintain monitoring solutions that ensure optimal system performance. This position reports to a Senior Member of Technical Staff - Software Development.
Team Summary:
The Logging, Metrics, and Monitoring (LMM) team is a key driver in building and delivering observability services and tools for engineering teams within Cloud Engineering & Operations and Research & Development. Our solutions are critical and are used daily by hundreds of developers to develop, monitor, troubleshoot, and optimize web services. We manage large-scale, distributed, fault-tolerant systems that collect and host vast volumes of log and metric data, enabling data-driven decision-making across the organization. Our work directly impacts the productivity of engineering teams across athenaNation, empowering them to innovate faster and operate more reliably. In this role, you will address a wide range of challenges-from fine-tuning system performance and scaling services to debugging complex issues. You will collaborate closely with development teams to deliver new monitoring features, enhance existing tools, and resolve critical engineering problems-all within an agile environment that leverages both private and public cloud platforms.
Essential Job Responsibilities:
- Develop and maintain scalable observability solutions using Kubernetes, Prometheus, Grafana, and ELK stack.
- Manage cloud infrastructure and automation using AWS and Terraform to support monitoring and logging systems.
- Analyze system metrics and logs to identify performance bottlenecks and reliability issues.
- Collaborate with software development teams to integrate monitoring and alerting into applications and services.
- Automate deployment, configuration, and management of monitoring tools and infrastructure.
- Lead incident response efforts and conduct root cause analysis to improve system resilience.
- Contribute to the design and implementation of infrastructure as code and continuous delivery pipelines.
Additional Job Responsibilities:
- Assist in capacity planning and infrastructure scaling to meet evolving business needs.
- Support documentation of system architecture, operational procedures, and troubleshooting guides.
- Participate in knowledge sharing and mentoring within the team and broader engineering community.
- Evaluate and recommend new tools and technologies to enhance observability and reliability.
- Engage in cross-functional projects to improve platform stability and performance.
- Support compliance and security initiatives related to infrastructure and monitoring systems.
- Contribute to continuous improvement initiatives for operational processes and tooling.
Expected Education & Experience:
- Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
- 5 to 8 years of experience in site reliability engineering, systems engineering, or a related discipline.
- Hands-on expertise in Kubernetes and Amazon EKS, including cluster management, workload deployment, troubleshooting, and implementing CI/CD-driven Helm chart releases for scalable and reliable application delivery.
- Strong scripting and automation skills using languages such as Python, Bash, or similar.
- Experience managing large-scale, distributed, fault-tolerant systems.
- Familiarity with infrastructure as code and CI/CD pipelines.
- Excellent analytical and problem-solving skills with the ability to collaborate effectively across teams.