
Search by job, company or skills
AI Platform Engineer
We're looking for a hands‑on AI Platform Engineer to own the infrastructure, operations, and SDLC for the Enterprise AI team. This role is focused on running, hardening, and evolving GenAI platforms, while also helping build internal platform tooling that enables GenAI development at SEI.
You will support teams building GenAI solutions by providing secure, reliable, well‑operated infrastructure, strong deployment standards, and clear operational guardrails. Success in this role is measured by system stability, security, scalability, and developer velocity across GenAI workloads.
This is a role for an engineer with an operator mindset, strong cloud fundamentals, and deep ownership of production systems.
What you'll do
GenAI Platform Operations
• Contribute to the runtime and operational posture for the Enterprise AI team, including RAG systems, inference services, evaluation jobs, and supporting pipelines.
• Operate Azure Foundry/ Azure AI services, Azure AI Search (vector and hybrid), and related data access patterns.
• Help define and maintain standardized platform patterns for GenAI infrastructure on the Enterprise AI team.
SDLC Enablement
• Enable secure and repeatable SDLC workflows for GenAI systems, including CI/CD, environment promotion, rollback strategies, and operational standards.
• Work within the team to establish and evolve shared standards for building and operating GenAI systems.
Infrastructure, Networking, and Security
• Design, provision, and operate Azure infrastructure using Terraform.
• Manage modular Terraform configurations across environments (dev, test, prod) with strong state and lifecycle discipline.
• Interact with Azure networking, including VNets, subnet segmentation, private endpoints, firewall integration, routing, DNS, and egress control.
• Ensure secure access to Azure resources using managed identities and RBAC, avoiding key‑based access where possible.
• Partner with security teams to meet enterprise and regulatory requirements.
Reliability, Observability, and Operations
• Implement observability by default using App Insights, Log Analytics, and OpenTelemetry.
• Define and monitor SLIs and SLOs such as latency, availability, error rates, and cost signals.
• Own incident response, root cause analysis, and operational runbooks for GenAI infrastructure.
GenAI Controls and Governance
• Operationalize quality and safety controls including evaluation automation, grounding checks, drift detection, and version management.
• Support human‑in‑the‑loop workflows and controlled rollout of model, prompt, and configuration changes.
• Maintain operational documentation and audit‑ready system visibility.
What makes you a great fit
Aptitude and attitude
• You think like a platform engineer and anticipate failure modes before they reach production.
• You prefer automation, repeatability, and clear abstractions over one‑off solutions.
• You take ownership of systems end to end and communicate clearly with both engineers and stakeholders.
Gen AI platform mindset
• You understand how LLMs, embeddings, and RAG systems work, with a focus on operating them reliably.
• You are comfortable enabling fast‑moving GenAI development while enforcing necessary guardrails.
• You want to help automate day to day operational tasks using AI.
Qualifications
Required
• 5+ years of production software, platform, or cloud engineering experience using Python.
• Terraform experience is required, including real‑world ownership of infrastructure provisioning and change management.
• Hands‑on experience operating Kubernetes, API Gateways, VMSS(or similar autoscaling compute service), App Services (or similar PaaS service)
• Knowledge of Azure networking, including VNets, private endpoints, firewall integration, routing, and DNS.
• Strong DevOps foundations including Docker, Kubernetes or AKS, and CI/CD pipelines.
• Experience implementing observability and monitoring for production systems.
Preferred
• Experience operating systems in regulated or enterprise environments.
• Familiarity with feature flags, canary deployments, and progressive delivery.
• Solid understanding of LLM operational concerns such as quotas, rate limits, latency, cost management, and failure modes.
• Experience securing cloud systems using managed identities, RBAC, SSO Integration, and network isolation.
Job ID: 147165881
Skills:
Data Warehousing, Typescript, MLops, Kubernetes, Python, AWS, LangChain, Infrastructure as Code, data pipelines, CI CD, LangGraph, ETL processes, Amazon Bedrock
Skills:
agent development , Python, Vector databases
Skills:
Java, Typescript, Javascript, Python, Model Context Protocol, search systems, Go, observability constructs, Semantic Kernel, LangGraph, application integration governance, OAuth2 flows, cloud native architecture, OpenAPI, functional evaluation metrics
Skills:
Scrum, puthon scripting, Llm, CI/CD
We don’t charge any money for job offers