Search by job, company or skills

GeekyAnts

DevOps Engineer

new job description bg glownew job description bg glownew job description bg svg
  • Posted 2 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description


We are seeking an experienced DevOps / Site Reliability Engineer (L5) to own and scale the production operations of a large-scale, AI-first platform. In this role, you will be responsible for reliability, performance, observability, and cost efficiency across cloud-native workloads running on GCP and Kubernetes. You will work closely with platform, data, and AI teams to ensure resilient, secure, and highly available systems in production.

Key Responsibilities

Own day-2 production operations for a large-scale AI-driven platform running on Google Cloud Platform (GCP).

Run, scale, and harden GKE-based Kubernetes workloads integrated with GCP managed services (data, messaging, AI, networking, and security).

Define, implement, and operate SLIs, SLOs, and error budgets across platform and AI services.

Build and manage end-to-end observability using New Relic (APM, infrastructure monitoring, logging, alerts, and dashboards).

Design, improve, and maintain CI/CD pipelines and Terraform-driven infrastructure automation.

Operate and integrate Azure AI Foundry for LLM deployments and model lifecycle management.

Lead incident response, conduct postmortems, and drive long-term reliability and resilience improvements.

Optimize cost, performance, and autoscaling for AI- and data-intensive workloads.

Collaborate with engineering and leadership teams to drive best practices in reliability, security, and operations.



Key Skil

ls6+ years of hands-on experience in DevOps, SRE, or Platform Engineering role

s.Strong, production-grade expertise in Google Cloud Platform (GCP), especially GKE and core managed service

s.Proven experience running Kubernetes at scale in live, mission-critical environment

s.Deep hands-on expertise with New Relic in complex, distributed system

s.Solid experience operating AI/ML or LLM-powered platforms in productio

n.Strong background in Terraform, infrastructure as code, and CI/CD pipeline

s.Good understanding of cloud networking, security, and reliability engineering principle

s.Ability to own and operate production systems end-to-end with minimal supervisio

n.

Good-to-Have Sk

illsExperience with multi-cloud environments (GCP + Azu

re).Familiarity with FinOps practices for cloud cost optimizat

ion.Exposure to service mesh, advanced autoscaling strategies, and capacity plann

ing.Experience with data-intensive or real-time syst

ems.Knowledge of security best practices, compliance, and IAM in cloud environme

nts.Prior experience mentoring junior engineers or leading operational initiati

ves.

Educational Qualifi

cations
Education & Qualif

icationsBachelor's degree in Computer Science, Information Technology, Engineering, or a relate

d field.Master's degree in a relevant discipline is a plus, but not ma

ndatory.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 143228891