
Search by job, company or skills
Job Title: Site Reliability Engineer 3
Location: Bengaluru, Karnataka
We are seeking an experienced Site Reliability Engineer (SRE) with deep expertise in both AWS and on-premises environments to support, scale, and optimize our hybrid cloud infrastructure. In this role, you will partner closely with engineering, data, and platform teams to ensure the reliability, performance, and operational excellence of our containerized services, data pipelines, observability platforms, and security systems.
What you will do:
Drive reliability strategy to ensure service uptime, availability, latency targets, and overall system health across critical platforms.
Lead the definition and governance of SLIs, SLOs, and SLAs across multiple systems and services.
Lead major incident management and drive cross-team coordination for critical production issues.
Drive organization-wide RCA processes and implement systemic reliability improvements.
Drive strategic operational excellence programs to improve platform reliability, performance, and scalability.
Enhance observability frameworks and implement monitoring strategies using Grafana, Prometheus, CloudWatch, and log aggregation tools across services.
Participate in on-call rotation; troubleshoot incidents across the stack (network, compute, storage, data pipelines, applications).
Drive large-scale automation initiatives to eliminate manual processes and improve platform reliability.
Design platform-level tools and frameworks to improve reliability and developer productivity across teams.
Participate in a 24×7 shift rotation, including nights, weekends, and holidays.
Drive adoption of AI-driven operations, intelligent monitoring, and auto-healing capabilities.
Mentor engineers and drive SRE best practices across teams.
Enforce best practices for access management, network security, secrets management, patching, and vulnerability remediation.
Collaborate with security teams to ensure compliance with organizational and regulatory standards.
Who you are:
3–8 years of experience in an infrastructure and systems environment delivering operational excellence to highly complex distributed systems.
Bachelor's degree in Computer Science or a related field, or equivalent work experience.
Deep expertise in Linux, observability platforms, AWS hybrid environments, container orchestration, and automation frameworks in large-scale production environments.
Strong expertise in Incident Management & Problem Management, leading major incident resolution and driving long-term reliability improvements.
Extensive experience working in a 24/7 operations support environment and managing critical production systems.
Strong experience with enterprise monitoring and observability tools such as Grafana, Prometheus, New Relic, and Dynatrace.
Extensive experience working with hybrid environments (AWS and on-premises infrastructure).
AWS and CKA certifications and advanced cloud architecture knowledge are highly desirable.
Strong experience working with containerization and orchestration platforms.
Experience driving automation, infrastructure-as-code practices, and platform reliability improvements across engineering teams.
[Confidential Information] /whatsapp 9886683329
At Angel One, our thriving culture is rooted in Diversity, Equity, and Inclusion (DEI).
As an Equal opportunity employer, we wholeheartedly welcome people from all backgrounds irrespective of caste, religion, gender, marital status, sexuality, disability, class or age to be part of our team. We believe that everyone's unique experiences and viewpoints make us stronger together. Come and be a part of #OneSpace*, where your individuality is celebrated and embraced.
Job ID: 146061681