Description
We are seeking a senior Ops Engineer to support GenAI, LLM, and ML workloads, with a strong focus on deployment automation, observability, scalability, and platform reliability across Azure and Kubernetes environments.
Key Responsibilities
- Build and maintain CI/CD/CT pipelines for ML models, LLMs, and GenAI workloads
- Deploy and operationalize :
- ML models and custom LLMs
- AI agents and GenAI services using Databricks, MLflow, AKS / ARO
- Integrate and scale GenAI ecosystems including :
- Azure OpenAI / OpenAI
- HuggingFace models
c RAG pipelines and vector databases
- Support development and deployment of custom models and out of the box AI agents
- Manage Databricks :
- Workspaces
- Clusters
- Model registry
- Job orchestration
- Own AKS / ARO lifecycle, including networking, scaling, Helm based deployments, and GitOps workflows
- Implement robust observability for AI/ML/LLM systems (latency, drift, reliability, performance)
- Ensure cloud security, governance, access controls, and cost efficiency
Required Skills
- Strong hands on experience with Azure, AKS/Kubernetes, Databricks, MLflow
- Experience with LLMOps, RAG pipelines, and vector stores (FAISS, Pinecone, Chroma, etc.)
- Proficiency in Python and automation scripting
- Strong understanding of AI/ML system operations and platform reliability
(ref:hirist.tech)