Overview
The Lead AIOps Engineer is responsible for architecting, provisioning, and operationalizing multi-environment AI platforms on Google Cloud (Sandbox, Dev, Prod). The role includes cloud environment setup, IAM governance, CI/CD pipeline development, AIOps automation, drift detection, lifecycle process design, documentation, and alignment with broader enterprise platforms. This is a hands-on technical leadership position.
Responsibilities
Environment Provisioning
- Conduct workshops to gather GCP environment requirements.
- Design cloud architecture including VPC, IAM, subnetting, quotas, endpoints, and security controls.
- Lead the provisioning of Sandbox, Dev, and Prod GCP projects using Terraform.
- Oversee API enablement, configuration, and validation testing.
Role Definitions & IAM Governance
- Define IAM roles for AI platform users (Owner, Support, ML Engineer, Viewer).
- Create IAM matrices, RACI charts, and detailed access control documentation.
- Ensure least-privilege access policies across Vertex AI and GCP services.
- Coordinate reviews and approvals with security and architecture teams.
AIOps Framework Development
- Design and implement drift detection, anomaly monitoring, canary releases, automated rollback, and observability components.
- Build reusable CI/CD pipelines using Vertex Pipelines and Cloud Build.
- Develop SOPs, diagrams, runbooks, and the full AIOps operations playbook.
- Execute and validate synthetic drift, monitoring, and pipeline test scenarios.
Lifecycle Processes
- Define the complete ML lifecycle from environment setup through deployment, monitoring, retraining triggers, and retirement.
- Integrate lifecycle processes within CI/CD and AIOps automation.
- Document all lifecycle flows in Confluence and conduct validation sessions.
Resource Planning & Cost Modelling
- Develop team structure, roles, and support plans.
- Build cost and usage models using GCP calculators and automation scripts.
- Prepare development and production usage forecasts and long-term TCO estimates.
Alignment Analysis
- Assess synergy with existing enterprise initiatives (Data Lake, Billing, Cloud Migration, Security).
- Document dependencies, risks, and overlapping components.
- Produce final recommendations and alignment reports.
Requirements
Core Technical Skills
- Strong expertise in Google Cloud Platform: Vertex AI, IAM, VPC, Cloud Build, Cloud Run, Cloud Functions, Pub/Sub.
- Deep experience with Terraform and Infrastructure as Code workflows.
- Practical experience with AIOps and MLOps frameworks.
- Proficient in Python for automation and monitoring jobs.
- Experience designing and operating CI/CD pipelines for ML workloads.
- Knowledge of observability tools such as Cloud Monitoring, Logging, and OpenTelemetry.
Soft Skills
- Strong client-facing and stakeholder engagement abilities.
- Experience leading engineering teams and driving architectural decisions.
- Excellent documentation and presentation skills.
- Ability to guide cross-functional teams through complex technical implementations.
Preferred Qualifications
- GCP Professional ML Engineer or Cloud Architect certification.
- Experience with Looker or other operational dashboards.
- Background in ML engineering or SRE.