Senior DevOps Engineer (Multi-Stack & LLMOps) - India (Remote/Hybrid)
We are hiring a versatile Senior DevOps Engineer to own automation, deployment, and infrastructure operations across a diverse application ecosystem. This is not a single-stack role, you will support legacy PHP environments, modern Node.js/React applications, high-performance .NET services, and an expanding set of GenAI/LLM-powered features.
The ideal candidate is a polyglot infrastructure engineer who is comfortable operating in both AWS and Azure, and who treats reliability, security, and cost controls as first-class production requirements across traditional web workloads and AI workloads.
Location: India (Remote/Hybrid, depending on city/team needs)
Work Hours Requirement: Must be able to work overlapping hours through 2:30 PM Eastern Time (EST/ET).
What You'll Do (Key Responsibilities)
Multi-Stack CI/CD
- Design, build, and maintain robust CI/CD pipelines for .NET Core, Node.js (React/Express), and PHP (Laravel/Symfony) using GitHub Actions, Azure DevOps, and/or GitLab CI.
- Standardize build, test, security scanning, and release workflows across multiple product lines.
Infrastructure as Code (Hybrid Cloud)
- Manage a hybrid footprint across AWS and Azure using Terraform or Pulumi, ensuring consistent, repeatable environments (dev/stage/prod).
- Improve provisioning speed, environment parity, and drift detection.
Container Orchestration
- Operate production-grade Kubernetes environments (EKS/AKS) including scaling, upgrades, networking, and cluster security.
- Optimize compute for both standard web traffic and AI workloads, including scheduling and capacity planning for resource-intensive services.
LLMOps / GenAI Platform Operations (Highly Desired)
- Build and operate the plumbing for GenAI initiatives, including model-serving stacks and integrations.
- Deploy and manage model-serving containers (e.g., vLLM, Ollama) and support vector database infrastructure (e.g., Pinecone, Milvus).
- Implement operational controls such as:
- Prompt versioning and lifecycle management (repo-driven workflows, approvals, rollback)
- Model switching/routing (by cost, latency, quality, and availability across providers like OpenAI/Anthropic and/or self-hosted)
- Token/usage monitoring, rate-limit governance, and spend controls with cost attribution (by environment, feature, and tenant)
- Evals/regression testing to catch prompt/model degradation before production impact
Observability & Reliability Engineering
- Implement end-to-end observability for services and pipelines (metrics, logs, traces) using tools such as OpenTelemetry, Grafana, Datadog, or New Relic.
- Build alerting and runbooks; participate in incident response, root-cause analysis, and reliability improvements.
Database & Data Platform Support
- Support a range of data needs across relational systems (SQL Server, MySQL, PostgreSQL) and modern stores including NoSQL and vector databases.
- Assist with backup/restore strategies, performance tuning basics, and production readiness.
Security & Compliance
- Implement consistent security controls across CI/CD (SAST/DAST, dependency scanning, container scanning).
- Manage secrets and key material with AWS Secrets Manager and/or Azure Key Vault.
- Enforce least-privilege IAM/RBAC patterns across cloud and Kubernetes.
Twilio / Real-Time Communications (Nice-to-Have)
- Support production usage of Twilio Voice/Video/SMS including secure webhook configuration, operational monitoring, and reliability concerns for real-time workflows.
Required Qualifications (Must Have)
Cloud & Platform Engineering
- Hands-on experience in both AWS and Azure, including compute, networking, identity, managed services, and deployment patterns.
Application Delivery Across Multiple Stacks
- Proven experience deploying and scaling:
- .NET/C# (IIS, Kestrel, Azure App Service)
- Node.js/React (Nginx, PM2, S3/CloudFront or equivalent)
- PHP (FPM, Apache/Nginx, Composer)
Automation & Orchestration
- Strong production experience with Docker and Kubernetes (required).
Infrastructure as Code
- Strong experience with Terraform or Pulumi in real-world production environments.
CI/CD & Developer Enablement
- Deep experience with CI/CD tooling (GitHub Actions/Azure DevOps/GitLab CI) and the ability to improve developer velocity safely.
Operational Excellence
- Strong troubleshooting skills across application, infrastructure, networking, and Kubernetes layers.
- Experience supporting production systems, on-call rotation, and incident response.
Highly Desired (LLMOps / GenAI Operations)
- Prompt lifecycle management: prompt repositories, versioning, templating, approvals, and rollback.
- Model operations:model switching/routing across OpenAI/Anthropic/self-hosted options, with gateway/proxy patterns and policy enforcement.
- Usage & cost governance:token monitoring, per-tenant attribution, budget alerts, and rate-limit controls.
- Quality workflows: eval harnesses, regression testing for prompts/models, A/B testing, safe rollout strategies.
- Vector + retrieval operations: operating Pinecone/Milvus and supporting retrieval pipelines.
Nice-to-Have
- Twilio Voice/Video/SMS in production (webhooks, auth, monitoring, incident response).
- GPU scheduling/optimization experience in Kubernetes.
- Experience with service mesh, policy-as-code, or advanced cluster security (OPA/Gatekeeper, Kyverno).
What Success Looks Like
- Stable, repeatable releases across multiple stacks with minimal manual work.
- Clean infrastructure workflows with low drift and fast environment provisioning.
- Reliable Kubernetes operations with strong security posture and observability.
- Production-grade GenAI ops: prompt versioning, model switching, token/cost monitoring, and quality guardrails.
Work Authorization / Schedule
- Role is based in India.
- Candidate must be able to work overlapping hours through 2:30 PM Eastern Time (ET).