Lead Devops Engineer

TheHireHub.Ai

Gurugram, Gurugram, India

8-12 Years

Save

Posted a day ago
Be among the first 10 applicants

Early Applicant

Job Description

Lead DevOps Platform Engineer

Level: 812+ years (CI/CD pipelines, MLOps/LLMOps automation, observability, security, and cost controls)

About the Organization

The organization is a global consulting firm with over 10,000 entrepreneurial, action- and results-oriented professionals across more than 40 countries. It takes a hands-on approach to solving complex client problems and helping organizations reach their full potential. The culture celebrates independent thinkers and doers who create meaningful impact and shape the industry. A collaborative environment, guided by strong core values, defines how teams work and succeed together.

Role Overview

The Lead / DevOps Platform Engineer is a foundational role responsible for enabling reliable, secure, scalable, and cost-governed delivery of AI, Machine Learning, and Generative AI solutions across the enterprise.
This role owns the platform layer that sits beneath AI applicationscovering cloud infrastructure, CI/CD pipelines, MLOps/LLMOps automation, observability, security, and cost controls. The role ensures that AI solutions do not remain experimental but are production-ready, repeatable, auditable, and operable at scale.
This role exists to eliminate risks and provide a stable platform backbone for AI and data teams to innovate safely and efficiently.

Key Responsibilities

AI Platform & Cloud Architecture

Own and evolve cloud platform architecture supporting AI, ML, and GenAI workloads across all environments
Design platforms for model training, fine-tuning, high-availability inference, batch and event-driven pipelines, and long-running or agent-based workflows
Ensure platforms are cloud-native, modular, extensible, and aligned with enterprise architecture standards
Enable multi-cloud portability (Azure, AWS, GCP) through abstraction of cloud dependencies
Partner with GenAI & Data Architects to align platform capabilities with RAG pipelines, agent orchestration, and data platform architectures

CI/CD & Automation

Design and implement end-to-end CI/CD pipelines for applications, data pipelines, ML models, and GenAI prompts
Standardize environment promotion with automated testing, approvals, rollback, and release controls
Integrate pipelines with source control, artifact repositories, model registries, and prompt repositories
Implement progressive delivery patterns such as blue-green deployments, canary releases, and feature flags
Embed security scans, quality gates, and compliance checks directly into CI/CD workflows

Infrastructure as Code & Environment Standardization

Define and enforce Infrastructure-as-Code standards using Terraform, ARM/Bicep, and cloud SDKs
Automate provisioning of compute, storage, networking, Kubernetes clusters, and AI platform services
Ensure environments are reproducible, version-controlled, auditable, and free from configuration drift

Observability, Reliability & SRE Practices

Design and implement end-to-end observability including metrics, logs, and distributed tracing
Define and monitor SLIs and SLOs for AI, data, and platform services
Design for high availability, fault tolerance, and disaster recovery
Lead incident response, root-cause analysis, and post-incident reviews
Drive continuous reliability improvements using operational metrics

Cost Management & FinOps

Implement FinOps practices for AI and data platforms
Track and optimize infrastructure usage, cost per inference, and GenAI token consumption
Establish cost guardrails including budgets, alerts, auto-scaling, and shutdown policies
Partner with architects and business stakeholders to balance accuracy, latency, scale, and cost

Security, Governance & Compliance

Embed security-by-design into platform architecture and delivery pipelines
Implement IAM, secrets management, encryption, network segmentation, and secure connectivity
Enable audit logging, traceability, and governance for model execution, prompt usage, and data access
Support internal and external audits, penetration testing, and compliance reviews

MLOps / LLMOps Enablement

Enable and operate MLOps and LLMOps platforms covering training, serving, monitoring, versioning, and rollback
Support automated evaluation, retraining, drift detection, and performance degradation alerts
Ensure platforms support experimentation without compromising production stability

Collaboration & Leadership

Collaborate with GenAI & Data Architects, AI Engineers, Backend and Frontend Engineers, Security, QA, and Delivery teams
Participate in Agile ceremonies, release planning, and roadmap discussions
Provide technical leadership and mentoring to DevOps and platform engineers
Define platform standards, documentation, and best practices
Act as a trusted advisor to leadership on scalability, risk, and cost

Professional Experience

812+ years of experience in DevOps, Platform Engineering, Cloud Infrastructure, or SRE roles
Proven experience designing and operating enterprise-scale, production platforms
Hands-on experience supporting AI/ML and GenAI workloads in regulated or security-conscious environments
Deep expertise in at least one major cloud platform (Azure, AWS, or GCP)
Strong experience with CI/CD, Infrastructure as Code, Kubernetes, and containerized workloads
Proven experience implementing observability, reliability engineering, and incident management practices
Strong understanding of cloud security, governance, and compliance requirements
Hands-on experience with cloud cost optimization and FinOps practices
Proven ability to lead and mentor platform teams and communicate effectively with executive stakeholders
Experience working in Agile and DevOps operating models

Qualifications

Bachelor's degree in Computer Science, Engineering, or a related discipline
Master's degree preferred
Relevant certifications strongly desired (Azure/AWS/GCP Architect or DevOps, Kubernetes, Terraform)