Search by job, company or skills

  • Posted a day ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Lead DevOps Platform Engineer

Level: 812+ years (CI/CD pipelines, MLOps/LLMOps automation, observability, security, and cost controls)

About the Organization

The organization is a global consulting firm with over 10,000 entrepreneurial, action- and results-oriented professionals across more than 40 countries. It takes a hands-on approach to solving complex client problems and helping organizations reach their full potential. The culture celebrates independent thinkers and doers who create meaningful impact and shape the industry. A collaborative environment, guided by strong core values, defines how teams work and succeed together.

Role Overview

  • The Lead / DevOps Platform Engineer is a foundational role responsible for enabling reliable, secure, scalable, and cost-governed delivery of AI, Machine Learning, and Generative AI solutions across the enterprise.
  • This role owns the platform layer that sits beneath AI applicationscovering cloud infrastructure, CI/CD pipelines, MLOps/LLMOps automation, observability, security, and cost controls. The role ensures that AI solutions do not remain experimental but are production-ready, repeatable, auditable, and operable at scale.
  • This role exists to eliminate risks and provide a stable platform backbone for AI and data teams to innovate safely and efficiently.

Key Responsibilities

AI Platform & Cloud Architecture

  • Own and evolve cloud platform architecture supporting AI, ML, and GenAI workloads across all environments
  • Design platforms for model training, fine-tuning, high-availability inference, batch and event-driven pipelines, and long-running or agent-based workflows
  • Ensure platforms are cloud-native, modular, extensible, and aligned with enterprise architecture standards
  • Enable multi-cloud portability (Azure, AWS, GCP) through abstraction of cloud dependencies
  • Partner with GenAI & Data Architects to align platform capabilities with RAG pipelines, agent orchestration, and data platform architectures

CI/CD & Automation

  • Design and implement end-to-end CI/CD pipelines for applications, data pipelines, ML models, and GenAI prompts
  • Standardize environment promotion with automated testing, approvals, rollback, and release controls
  • Integrate pipelines with source control, artifact repositories, model registries, and prompt repositories
  • Implement progressive delivery patterns such as blue-green deployments, canary releases, and feature flags
  • Embed security scans, quality gates, and compliance checks directly into CI/CD workflows

Infrastructure as Code & Environment Standardization

  • Define and enforce Infrastructure-as-Code standards using Terraform, ARM/Bicep, and cloud SDKs
  • Automate provisioning of compute, storage, networking, Kubernetes clusters, and AI platform services
  • Ensure environments are reproducible, version-controlled, auditable, and free from configuration drift

Observability, Reliability & SRE Practices

  • Design and implement end-to-end observability including metrics, logs, and distributed tracing
  • Define and monitor SLIs and SLOs for AI, data, and platform services
  • Design for high availability, fault tolerance, and disaster recovery
  • Lead incident response, root-cause analysis, and post-incident reviews
  • Drive continuous reliability improvements using operational metrics

Cost Management & FinOps

  • Implement FinOps practices for AI and data platforms
  • Track and optimize infrastructure usage, cost per inference, and GenAI token consumption
  • Establish cost guardrails including budgets, alerts, auto-scaling, and shutdown policies
  • Partner with architects and business stakeholders to balance accuracy, latency, scale, and cost

Security, Governance & Compliance

  • Embed security-by-design into platform architecture and delivery pipelines
  • Implement IAM, secrets management, encryption, network segmentation, and secure connectivity
  • Enable audit logging, traceability, and governance for model execution, prompt usage, and data access
  • Support internal and external audits, penetration testing, and compliance reviews

MLOps / LLMOps Enablement

  • Enable and operate MLOps and LLMOps platforms covering training, serving, monitoring, versioning, and rollback
  • Support automated evaluation, retraining, drift detection, and performance degradation alerts
  • Ensure platforms support experimentation without compromising production stability

Collaboration & Leadership

  • Collaborate with GenAI & Data Architects, AI Engineers, Backend and Frontend Engineers, Security, QA, and Delivery teams
  • Participate in Agile ceremonies, release planning, and roadmap discussions
  • Provide technical leadership and mentoring to DevOps and platform engineers
  • Define platform standards, documentation, and best practices
  • Act as a trusted advisor to leadership on scalability, risk, and cost

Professional Experience

  • 812+ years of experience in DevOps, Platform Engineering, Cloud Infrastructure, or SRE roles
  • Proven experience designing and operating enterprise-scale, production platforms
  • Hands-on experience supporting AI/ML and GenAI workloads in regulated or security-conscious environments
  • Deep expertise in at least one major cloud platform (Azure, AWS, or GCP)
  • Strong experience with CI/CD, Infrastructure as Code, Kubernetes, and containerized workloads
  • Proven experience implementing observability, reliability engineering, and incident management practices
  • Strong understanding of cloud security, governance, and compliance requirements
  • Hands-on experience with cloud cost optimization and FinOps practices
  • Proven ability to lead and mentor platform teams and communicate effectively with executive stakeholders
  • Experience working in Agile and DevOps operating models

Qualifications

  • Bachelor's degree in Computer Science, Engineering, or a related discipline
  • Master's degree preferred
  • Relevant certifications strongly desired (Azure/AWS/GCP Architect or DevOps, Kubernetes, Terraform)

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 139132619

Similar Jobs