
Search by job, company or skills
We are seeking an experienced DevOps / Site Reliability Engineer (L5) to own and scale the production operations of a large-scale, AI-first platform. In this role, you will be responsible for reliability, performance, observability, and cost efficiency across cloud-native workloads running on GCP and Kubernetes. You will work closely with platform, data, and AI teams to ensure resilient, secure, and highly available systems in production.
Key Responsibilities
Own day-2 production operations for a large-scale AI-driven platform running on Google Cloud Platform (GCP).
Run, scale, and harden GKE-based Kubernetes workloads integrated with GCP managed services (data, messaging, AI, networking, and security).
Define, implement, and operate SLIs, SLOs, and error budgets across platform and AI services.
Build and manage end-to-end observability using New Relic (APM, infrastructure monitoring, logging, alerts, and dashboards).
Design, improve, and maintain CI/CD pipelines and Terraform-driven infrastructure automation.
Operate and integrate Azure AI Foundry for LLM deployments and model lifecycle management.
Lead incident response, conduct postmortems, and drive long-term reliability and resilience improvements.
Optimize cost, performance, and autoscaling for AI- and data-intensive workloads.
Collaborate with engineering and leadership teams to drive best practices in reliability, security, and operations.
Key Skil
ls6+ years of hands-on experience in DevOps, SRE, or Platform Engineering role
s.Strong, production-grade expertise in Google Cloud Platform (GCP), especially GKE and core managed service
s.Proven experience running Kubernetes at scale in live, mission-critical environment
s.Deep hands-on expertise with New Relic in complex, distributed system
s.Solid experience operating AI/ML or LLM-powered platforms in productio
n.Strong background in Terraform, infrastructure as code, and CI/CD pipeline
s.Good understanding of cloud networking, security, and reliability engineering principle
s.Ability to own and operate production systems end-to-end with minimal supervisio
n.
Good-to-Have Sk
illsExperience with multi-cloud environments (GCP + Azu
re).Familiarity with FinOps practices for cloud cost optimizat
ion.Exposure to service mesh, advanced autoscaling strategies, and capacity plann
ing.Experience with data-intensive or real-time syst
ems.Knowledge of security best practices, compliance, and IAM in cloud environme
nts.Prior experience mentoring junior engineers or leading operational initiati
ves.
Educational Qualifi
cations
Education & Qualif
icationsBachelor's degree in Computer Science, Information Technology, Engineering, or a relate
d field.Master's degree in a relevant discipline is a plus, but not ma
ndatory.Job ID: 143228891