Senior DevOps Engineer (Multi-Stack & LLMOps)

Pumex Computing, LLC

India

Fresher

Save

Posted 3 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Senior DevOps Engineer (Multi-Stack & LLMOps) - India (Remote/Hybrid)

We are hiring a versatile Senior DevOps Engineer to own automation, deployment, and infrastructure operations across a diverse application ecosystem. This is not a single-stack role, you will support legacy PHP environments, modern Node.js/React applications, high-performance .NET services, and an expanding set of GenAI/LLM-powered features.

The ideal candidate is a polyglot infrastructure engineer who is comfortable operating in both AWS and Azure, and who treats reliability, security, and cost controls as first-class production requirements across traditional web workloads and AI workloads.

Location: India (Remote/Hybrid, depending on city/team needs)

Work Hours Requirement: Must be able to work overlapping hours through 2:30 PM Eastern Time (EST/ET).

What You'll Do (Key Responsibilities)

Multi-Stack CI/CD

Design, build, and maintain robust CI/CD pipelines for .NET Core, Node.js (React/Express), and PHP (Laravel/Symfony) using GitHub Actions, Azure DevOps, and/or GitLab CI.
Standardize build, test, security scanning, and release workflows across multiple product lines.

Infrastructure as Code (Hybrid Cloud)

Manage a hybrid footprint across AWS and Azure using Terraform or Pulumi, ensuring consistent, repeatable environments (dev/stage/prod).
Improve provisioning speed, environment parity, and drift detection.

Container Orchestration

Operate production-grade Kubernetes environments (EKS/AKS) including scaling, upgrades, networking, and cluster security.
Optimize compute for both standard web traffic and AI workloads, including scheduling and capacity planning for resource-intensive services.

LLMOps / GenAI Platform Operations (Highly Desired)

Build and operate the plumbing for GenAI initiatives, including model-serving stacks and integrations.
Deploy and manage model-serving containers (e.g., vLLM, Ollama) and support vector database infrastructure (e.g., Pinecone, Milvus).
Implement operational controls such as:
Prompt versioning and lifecycle management (repo-driven workflows, approvals, rollback)
Model switching/routing (by cost, latency, quality, and availability across providers like OpenAI/Anthropic and/or self-hosted)
Token/usage monitoring, rate-limit governance, and spend controls with cost attribution (by environment, feature, and tenant)
Evals/regression testing to catch prompt/model degradation before production impact

Observability & Reliability Engineering

Implement end-to-end observability for services and pipelines (metrics, logs, traces) using tools such as OpenTelemetry, Grafana, Datadog, or New Relic.
Build alerting and runbooks; participate in incident response, root-cause analysis, and reliability improvements.

Database & Data Platform Support

Support a range of data needs across relational systems (SQL Server, MySQL, PostgreSQL) and modern stores including NoSQL and vector databases.
Assist with backup/restore strategies, performance tuning basics, and production readiness.

Security & Compliance

Implement consistent security controls across CI/CD (SAST/DAST, dependency scanning, container scanning).
Manage secrets and key material with AWS Secrets Manager and/or Azure Key Vault.
Enforce least-privilege IAM/RBAC patterns across cloud and Kubernetes.

Twilio / Real-Time Communications (Nice-to-Have)

Support production usage of Twilio Voice/Video/SMS including secure webhook configuration, operational monitoring, and reliability concerns for real-time workflows.

Required Qualifications (Must Have)

Cloud & Platform Engineering

Hands-on experience in both AWS and Azure, including compute, networking, identity, managed services, and deployment patterns.

Application Delivery Across Multiple Stacks

Proven experience deploying and scaling:
.NET/C# (IIS, Kestrel, Azure App Service)
Node.js/React (Nginx, PM2, S3/CloudFront or equivalent)
PHP (FPM, Apache/Nginx, Composer)

Automation & Orchestration

Strong production experience with Docker and Kubernetes (required).

Infrastructure as Code

Strong experience with Terraform or Pulumi in real-world production environments.

CI/CD & Developer Enablement

Deep experience with CI/CD tooling (GitHub Actions/Azure DevOps/GitLab CI) and the ability to improve developer velocity safely.

Operational Excellence

Strong troubleshooting skills across application, infrastructure, networking, and Kubernetes layers.
Experience supporting production systems, on-call rotation, and incident response.

Highly Desired (LLMOps / GenAI Operations)

Prompt lifecycle management: prompt repositories, versioning, templating, approvals, and rollback.
Model operations:model switching/routing across OpenAI/Anthropic/self-hosted options, with gateway/proxy patterns and policy enforcement.
Usage & cost governance:token monitoring, per-tenant attribution, budget alerts, and rate-limit controls.
Quality workflows: eval harnesses, regression testing for prompts/models, A/B testing, safe rollout strategies.
Vector + retrieval operations: operating Pinecone/Milvus and supporting retrieval pipelines.

Nice-to-Have

Twilio Voice/Video/SMS in production (webhooks, auth, monitoring, incident response).
GPU scheduling/optimization experience in Kubernetes.
Experience with service mesh, policy-as-code, or advanced cluster security (OPA/Gatekeeper, Kyverno).

What Success Looks Like

Stable, repeatable releases across multiple stacks with minimal manual work.
Clean infrastructure workflows with low drift and fast environment provisioning.
Reliable Kubernetes operations with strong security posture and observability.
Production-grade GenAI ops: prompt versioning, model switching, token/cost monitoring, and quality guardrails.

Work Authorization / Schedule