Search by job, company or skills

A

Senior Member of Technical Staff - SMTS

new job description bg glownew job description bg glownew job description bg svg
  • Posted 3 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Join us as we work to create a thriving ecosystem that delivers accessible, high-quality, and sustainable healthcare for all.

Join Our Cloud Infrastructure Engineering & Operations Division as a Senior Site Reliability Engineer

We are seeking a highly skilled Senior Site Reliability Engineer to elevate our Cloud Infrastructure Engineering & Operations team. Your primary mission will be to enhance the performance, reliability, and scalability of our platforms by spearheading the development of a world-class observability ecosystem that drives business success.

The Team:

The Logging, Metrics, and Monitoring (LMM) team is at the forefront of building and delivering observability services and tools for our engineering communities within the Cloud Engineering & Operations, and Research & Development zones. Our solutions are critical-used daily by hundreds of developers to develop, monitor, troubleshoot, and optimize our web services. We manage large-scale, distributed, fault-tolerant systems that collect and host vast volumes of log and metric data, enabling data-driven decision-making across the organization.

Our work has a direct, measurable impact on the productivity of our engineering teams across athenaNation, empowering them to innovate faster and operate more reliably.

In this role, you will tackle a diverse set of challenges-from fine-tuning system performance and scaling services to debugging complex issues. You will partner closely with development teams to deliver new monitoring features, improve existing tools, and solve pressing engineering problems-all within an agile environment that leverages both private and public cloud platforms.

Job Responsibilities

  • Automate the deployment, configuration, and management of logging, metrics, and monitoring services leveraging Puppet and Infrastructure as Code best practices to ensure reliable and scalable operations.
  • Proactively troubleshoot and resolve complex production incidents, leveraging deep Linux system administration and engineering expertise to minimize downtime.
  • Lead cross-functional projects from conception through delivery, including designing scalable technical solutions, managing timelines, and ensuring successful implementation.
  • Architect and implement comprehensive monitoring strategies by developing metrics, dashboards, and alerting criteria to enable proactive service performance management and dynamic scaling.
  • Collaborate closely with engineering teams during weekly on-call rotations to swiftly diagnose and resolve high-impact issues, fostering a culture of reliability.
  • Partner with development teams to enhance their logging and telemetry capabilities, improving observability and operational efficiency.
  • Mentor and guide team members on best practices for incident response, system tuning, and service reliability.

Required Qualifications

  • 5-8 years of hands-on experience managing mission-critical production environments with a focus on Linux system administration and DevOps practices.
  • Expertise on Amazon Web Services and Cloud Native Approaches.
  • Experience working on Microservices, production grade infrastructure.
  • Proven expertise in managing and optimizing large-scale logging and data platforms such as Kafka, OpenSearch/Elasticsearch, and log forwarding agents like Vector or Fluentd.
  • Extensive experience with configuration management tools such as Puppet or Ansible, automating deployment and operations at scale.
  • Scripting experience with Python or Bash.
  • Demonstrated success troubleshooting and resolving issues in Linux-based production services, including participating actively in on-call rotations.
  • Proficiency in scripting and programming languages including Bash, Python, and Golang for automation, tooling, and integrations.
  • Strong expertise in Infrastructure as Code using Terraform and AWS CloudFormation to build resilient, repeatable deployment workflows.
  • Ability to rapidly adapt to evolving technology environments and business priorities with a bias toward reliability and automation.

Additional Qualifications

  • Experience managing large-scale production server fleets (thousands of nodes) with high availability and fault tolerance.
  • Deep subject matter expertise in technologies such as Graphite, ClickHouse, Prometheus, Grafana, Docker, Jenkins, and Git.
  • Familiarity with AWS cloud architecture, deployment, and operational best practices, with hands-on experience deploying scalable cloud-native applications.
  • Proficiency with protocol analyzers like tcpdump and Wireshark for network troubleshooting and performance diagnostics.

-

Job ID: 135131441