CA0294 - Site Reliability Engineer I

Cloudangles

Noida, India

1-3 Years

Save

Posted 5 days ago
Be among the first 20 applicants

Early Applicant

Job Description

Job Summary

Site Reliability Engineers (SRE's) cover the intersection of Software Engineer and Systems Administrator. In other words, they can both create code and manage the infrastructure on which the code runs. This is a very wide skillset, but the end goal of an SRE is always the same: to ensure that all SLAs are met, but not exceeded, so as to balance performance and reliability with operational costs.

As a Site Reliability Engineer I, you will be learning our systems, improving your craft as an engineer, and taking on tasks that improve the overall reliability of the Personify Health platform.

Essential Functions/Responsibilities/Duties

Gather and analyze metrics from systems and applications to support performance optimization and fault diagnosis.
Monitor, observe, and define observability strategies using New Relic and other tools to identify and fix issues proactively before they impact users.
Learn the fundamentals of sustainable service and system operation, focusing on reliability, efficiency, automation, debugging, understanding technologies, and working against plans and schedules.
Execute tasks and solve problems with clear solutions, working in a team under guidance from the manager and SREs. Collaborate to prioritize high-value tasks that deliver quality results for customers and stakeholders.
Comprehend and actively engage in the team's core processes, including planning, on-call rotations, incident triage, and metrics review.
Collaborate with development teams to improve services and assist in platform management and capacity planning.
Perform other duties as assigned.

Education And Experience

Bachelor's degree in computer science, engineering, or related experience.
1+ years in SRE, DevOps, or infrastructure engineering role.

Required Knowledge, Skills, And Abilities

Experience in cloud platforms such as AWS with container orchestration (Kubernetes, EKS), infrastructure, and monitoring patterns.
Experience with key observability principles like SLIs, SLOs, and Error budgets.
Programming skills in Python, Go, or Java.
Experience with New Relic, DataDog, Prometheus or similar monitoring tools
Problem solver with a passion for root cause analysis and continuous improvement.
A clear, concise, and collaborative communicator who excels in cross-functional environments.
Demonstrable knowledge of continuous integration and/or continuous deployment tools and scripting. Skills in GitLab or ArgoCD would be a bonus.
Familiarity with ITIL or similar incident management frameworks.
Cloud certifications in AWS is a bonus

Prior experience in healthcare or other highly regulated industries is preferred.