Join us as we work to create a thriving ecosystem that delivers accessible, high-quality, and sustainable healthcare for all.
Position Summary:
Join athenahealth as a Senior Site Reliability Engineer - MTS at the Associate level, based in Bangalore - Whitefield, working in a hybrid environment. This role offers an exciting opportunity to contribute to the reliability and scalability of critical cloud infrastructure and services. You will collaborate closely with engineering teams to ensure high availability and performance of our systems. This position reports directly to a Senior Manager.
About the Team:
The Logging, Metrics, and Monitoring (LMM) team plays a pivotal role in delivering observability services and tools that empower engineering teams across Cloud Engineering & Operations and Research & Development. Our team builds and maintains large-scale, distributed, fault-tolerant systems that collect, store, and analyze vast volumes of log and metric data. These solutions are essential for hundreds of developers daily, enabling them to monitor, troubleshoot, and optimize web services effectively. By providing robust observability infrastructure, the LMM team supports data-driven decision-making and continuous improvement across the organization.
Essential Job Responsibilities:
- Develop and maintain scalable, reliable logging, metrics, and monitoring systems using modern cloud-native technologies.
- Manage containerized environments leveraging Docker and Kubernetes to support application deployment and orchestration.
- Analyze system performance and reliability metrics to identify and resolve issues proactively.
- Collaborate with development teams to integrate observability best practices into the software development lifecycle.
- Automate operational processes to improve efficiency and reduce manual intervention.
- Participate in incident response and root cause analysis to enhance system resilience.
- Contribute to the design and implementation of infrastructure as code and configuration management solutions.
Additional Job Responsibilities:
- Assist in capacity planning and infrastructure scaling strategies.
- Support continuous integration and continuous deployment (CI/CD) pipelines to streamline releases.
- Document system architecture, operational procedures, and troubleshooting guides.
- Engage in knowledge sharing and mentoring within the team and broader engineering community.
- Evaluate and recommend new tools and technologies to enhance observability capabilities.
- Participate in cross-functional projects to improve overall platform reliability.
- Support compliance and security initiatives related to infrastructure and monitoring systems.
Expected Education & Experience:
- Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
- 2 to 5 years of experience in site reliability engineering, systems engineering, or a related role.
- Proficiency with Linux, Docker, and Kubernetes in production environments.
- Experience with logging, metrics, and monitoring tools and frameworks.
- Strong scripting and automation skills using languages such as Python, Bash, or similar.
- Familiarity with cloud platforms and infrastructure as code tools is preferred.
- Excellent problem-solving skills and ability to work collaboratively in a team environment.