Lead DevOps Platform Engineer
Level: 812+ years (CI/CD pipelines, MLOps/LLMOps automation, observability, security, and cost controls)
About the Organization
The organization is a global consulting firm with over 10,000 entrepreneurial, action- and results-oriented professionals across more than 40 countries. It takes a hands-on approach to solving complex client problems and helping organizations reach their full potential. The culture celebrates independent thinkers and doers who create meaningful impact and shape the industry. A collaborative environment, guided by strong core values, defines how teams work and succeed together.
Role Overview
- The Lead / DevOps Platform Engineer is a foundational role responsible for enabling reliable, secure, scalable, and cost-governed delivery of AI, Machine Learning, and Generative AI solutions across the enterprise.
- This role owns the platform layer that sits beneath AI applicationscovering cloud infrastructure, CI/CD pipelines, MLOps/LLMOps automation, observability, security, and cost controls. The role ensures that AI solutions do not remain experimental but are production-ready, repeatable, auditable, and operable at scale.
- This role exists to eliminate risks and provide a stable platform backbone for AI and data teams to innovate safely and efficiently.
Key Responsibilities
AI Platform & Cloud Architecture
- Own and evolve cloud platform architecture supporting AI, ML, and GenAI workloads across all environments
- Design platforms for model training, fine-tuning, high-availability inference, batch and event-driven pipelines, and long-running or agent-based workflows
- Ensure platforms are cloud-native, modular, extensible, and aligned with enterprise architecture standards
- Enable multi-cloud portability (Azure, AWS, GCP) through abstraction of cloud dependencies
- Partner with GenAI & Data Architects to align platform capabilities with RAG pipelines, agent orchestration, and data platform architectures
CI/CD & Automation
- Design and implement end-to-end CI/CD pipelines for applications, data pipelines, ML models, and GenAI prompts
- Standardize environment promotion with automated testing, approvals, rollback, and release controls
- Integrate pipelines with source control, artifact repositories, model registries, and prompt repositories
- Implement progressive delivery patterns such as blue-green deployments, canary releases, and feature flags
- Embed security scans, quality gates, and compliance checks directly into CI/CD workflows
Infrastructure as Code & Environment Standardization
- Define and enforce Infrastructure-as-Code standards using Terraform, ARM/Bicep, and cloud SDKs
- Automate provisioning of compute, storage, networking, Kubernetes clusters, and AI platform services
- Ensure environments are reproducible, version-controlled, auditable, and free from configuration drift
Observability, Reliability & SRE Practices
- Design and implement end-to-end observability including metrics, logs, and distributed tracing
- Define and monitor SLIs and SLOs for AI, data, and platform services
- Design for high availability, fault tolerance, and disaster recovery
- Lead incident response, root-cause analysis, and post-incident reviews
- Drive continuous reliability improvements using operational metrics
Cost Management & FinOps
- Implement FinOps practices for AI and data platforms
- Track and optimize infrastructure usage, cost per inference, and GenAI token consumption
- Establish cost guardrails including budgets, alerts, auto-scaling, and shutdown policies
- Partner with architects and business stakeholders to balance accuracy, latency, scale, and cost
Security, Governance & Compliance
- Embed security-by-design into platform architecture and delivery pipelines
- Implement IAM, secrets management, encryption, network segmentation, and secure connectivity
- Enable audit logging, traceability, and governance for model execution, prompt usage, and data access
- Support internal and external audits, penetration testing, and compliance reviews
MLOps / LLMOps Enablement
- Enable and operate MLOps and LLMOps platforms covering training, serving, monitoring, versioning, and rollback
- Support automated evaluation, retraining, drift detection, and performance degradation alerts
- Ensure platforms support experimentation without compromising production stability
Collaboration & Leadership
- Collaborate with GenAI & Data Architects, AI Engineers, Backend and Frontend Engineers, Security, QA, and Delivery teams
- Participate in Agile ceremonies, release planning, and roadmap discussions
- Provide technical leadership and mentoring to DevOps and platform engineers
- Define platform standards, documentation, and best practices
- Act as a trusted advisor to leadership on scalability, risk, and cost
Professional Experience
- 812+ years of experience in DevOps, Platform Engineering, Cloud Infrastructure, or SRE roles
- Proven experience designing and operating enterprise-scale, production platforms
- Hands-on experience supporting AI/ML and GenAI workloads in regulated or security-conscious environments
- Deep expertise in at least one major cloud platform (Azure, AWS, or GCP)
- Strong experience with CI/CD, Infrastructure as Code, Kubernetes, and containerized workloads
- Proven experience implementing observability, reliability engineering, and incident management practices
- Strong understanding of cloud security, governance, and compliance requirements
- Hands-on experience with cloud cost optimization and FinOps practices
- Proven ability to lead and mentor platform teams and communicate effectively with executive stakeholders
- Experience working in Agile and DevOps operating models
Qualifications
- Bachelor's degree in Computer Science, Engineering, or a related discipline
- Master's degree preferred
- Relevant certifications strongly desired (Azure/AWS/GCP Architect or DevOps, Kubernetes, Terraform)