Join us as we work to create a thriving ecosystem that delivers accessible, high-quality, and sustainable healthcare for all.
Join Our Cloud Infrastructure Engineering & Operations Division as a Senior Site Reliability Engineer
We are seeking a highly skilled Senior Site Reliability Engineer to elevate our Cloud Infrastructure Engineering & Operations team. Your primary mission will be to enhance the performance, reliability, and scalability of our platforms by spearheading the development of a world-class observability ecosystem that drives business success.
The Team:
The Logging, Metrics, and Monitoring (LMM) team is at the forefront of building and delivering observability services and tools for our engineering communities within the Cloud Engineering & Operations, and Research & Development zones. Our solutions are critical-used daily by hundreds of developers to develop, monitor, troubleshoot, and optimize our web services. We manage large-scale, distributed, fault-tolerant systems that collect and host vast volumes of log and metric data, enabling data-driven decision-making across the organization.
Our work has a direct, measurable impact on the productivity of our engineering teams across athenaNation, empowering them to innovate faster and operate more reliably.
In this role, you will tackle a diverse set of challenges-from fine-tuning system performance and scaling services to debugging complex issues. You will partner closely with development teams to deliver new monitoring features, improve existing tools, and solve pressing engineering problems-all within an agile environment that leverages both private and public cloud platforms.
Job Responsibilities
- Automate the deployment, configuration, and management of logging, metrics, and monitoring services leveraging Puppet and Infrastructure as Code best practices to ensure reliable and scalable operations.
- Proactively troubleshoot and resolve complex production incidents, leveraging deep Linux system administration and engineering expertise to minimize downtime.
- Lead cross-functional projects from conception through delivery, including designing scalable technical solutions, managing timelines, and ensuring successful implementation.
- Architect and implement comprehensive monitoring strategies by developing metrics, dashboards, and alerting criteria to enable proactive service performance management and dynamic scaling.
- Collaborate closely with engineering teams during weekly on-call rotations to swiftly diagnose and resolve high-impact issues, fostering a culture of reliability.
- Partner with development teams to enhance their logging and telemetry capabilities, improving observability and operational efficiency.
- Mentor and guide team members on best practices for incident response, system tuning, and service reliability.
Required Qualifications
- 5-8 years of hands-on experience managing mission-critical production environments with a focus on Linux system administration and DevOps practices.
- Expertise on Amazon Web Services and Cloud Native Approaches.
- Experience working on Microservices, production grade infrastructure.
- Proven expertise in managing and optimizing large-scale logging and data platforms such as Kafka, OpenSearch/Elasticsearch, and log forwarding agents like Vector or Fluentd.
- Extensive experience with configuration management tools such as Puppet or Ansible, automating deployment and operations at scale.
- Scripting experience with Python or Bash.
- Demonstrated success troubleshooting and resolving issues in Linux-based production services, including participating actively in on-call rotations.
- Proficiency in scripting and programming languages including Bash, Python, and Golang for automation, tooling, and integrations.
- Strong expertise in Infrastructure as Code using Terraform and AWS CloudFormation to build resilient, repeatable deployment workflows.
- Ability to rapidly adapt to evolving technology environments and business priorities with a bias toward reliability and automation.
Additional Qualifications
- Experience managing large-scale production server fleets (thousands of nodes) with high availability and fault tolerance.
- Deep subject matter expertise in technologies such as Graphite, ClickHouse, Prometheus, Grafana, Docker, Jenkins, and Git.
- Familiarity with AWS cloud architecture, deployment, and operational best practices, with hands-on experience deploying scalable cloud-native applications.
- Proficiency with protocol analyzers like tcpdump and Wireshark for network troubleshooting and performance diagnostics.