Search by job, company or skills

neurodiscovery ai

Lead DevOps Engineer

Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted 7 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

About The Role

We are building an Agentic AI platform for Neurology that processes sensitive clinical data across a multi-tenant, multi-cloud environment serving a large number of clients. Each tenant operates in a hybrid setup — a mix of cloud-hosted services and on-premise appliances deployed at client sites. We're looking for a Lead DevOps Engineer to own the infrastructure, establish production-grade orchestration, harden security across tenant boundaries, and build scalable, repeatable delivery pipelines. This is a hands-on leadership role. You'll shape the DevOps roadmap, enforce engineering discipline across deployments, and mentor a small team — all while keeping a complex hybrid, multi-tenant environment running reliably.

What You'll Do

Multi-Cloud & Multi-Tenant Infrastructure:

  • Design and manage infrastructure across AWS and GCP, ensuring consistent networking, security, and deployment patterns across both clouds.
  • Architect tenant-isolated environments with secure VPC networking — no public-facing IPs, private subnets, VPC peering, endpoints, and VPN connectivity.
  • Build and operate production Kubernetes clusters to host containerized microservices at scale.
  • Define the strategy for which workloads run where — cloud vs. on-premise — based on data sensitivity, latency, and compliance requirements.

CI,CD & Deployment Governance

  • Own and evolve a centralized, modular CI,CD pipeline built on GitHub Actions as the single path to production.
  • Eliminate direct developer access to production environments; implement controlled deployment workflows using session-based access tools (e.g., AWS SSM Session Manager).
  • Establish branch protection, image signing, environment promotion gates, and tenant-aware deployment strategies.

On-Premise Appliance Management

  • Oversee configuration management for client-site appliances using Chef for example in a client-server architecture.
  • Drive the strategy to progressively centralize microservices into cloud-hosted infrastructure, minimizing the on-premise footprint.
  • Define remote access procedures, failure runbooks, and contingency workflows for on-premise hardware.

Security & Compliance

  • Enforce infrastructure security best practices for a healthcare environment handling PHI and de-identified clinical data across tenant boundaries.
  • Manage VPN-based access to private cloud networks and implement least-privilege IAM, secrets management, and policy-as-code across all environments.
  • Ensure tenant data isolation at the network, storage, and compute layers.

Monitoring, Reliability & Backups

  • Build and maintain unified observability using Prometheus and Grafana across cloud and on-premise environments.
  • Own the backup and disaster recovery strategy — container registries, automated snapshots, and cross-cloud resilience.
  • Define and track SLOs for critical data pipelines and tenant-facing services.

Team & Process

  • Mentor junior DevOps,infrastructure engineers and collaborate closely with data engineering, AI, and IT teams.
  • Recommend and help hire for supporting roles (e.g., IT support for on-premise hardware operations).
  • Establish DevOps standards, documentation, and runbooks for the team.

What We're Looking For

Must Have

  • 6+ years of DevOps,Infrastructure,SRE experience, with at least 2 years in a lead or senior capacity.
  • Production experience across AWS and GCP — VPCs, IAM, compute, storage, and managed services on both platforms.
  • Hands-on experience running Kubernetes in production — cluster lifecycle, Helm charts, service mesh, autoscaling, and troubleshooting.
  • Deep expertise in CI,CD design using GitHub Actions (or comparable platforms) with a focus on security and governance.
  • Strong understanding of multi-tenant architecture patterns — network isolation, tenant-aware deployments, and data segregation.
  • Solid Docker and container lifecycle management experience.
  • Infrastructure-as-Code proficiency with Terraform (multi-provider) or equivalent.
  • Networking fundamentals — VPNs, VPCs, DNS, firewalls, load balancers, and zero-trust architectures.
  • Comfort with Python and shell scripting for automation.
  • A production-first, outcome-oriented mindset — you measure success by what's running reliably in production, not by what's in a slide deck. Customer value over story-point velocity.
  • Excellent communication skills — you can translate complex technical concepts for both engineering peers and business stakeholders.

Good to Have

  • 1+ years of DevOps team management experience — you've directly managed devops engineers, run standups, handled performance conversations, and built team culture.
  • Experience with the AI-native stack — vector databases (Pinecone, Weaviate,pgvector), RAG pipelines, feature stores, LLM orchestration frameworks(LangChain, LlamaIndex), and ML pipeline tooling (MLflow, Kubeflow,SageMaker).
  • Experience in healthcare, life sciences, or any environment with strict data privacy requirements (HIPAA, PHI handling).
  • Experience with configuration management tools such as Chef, Ansible, or Puppet.
  • Familiarity with Elasticsearch operations and management.
  • Experience managing hybrid environments with on-premise hardware alongside cloud infrastructure.
  • Exposure to Prometheus, Grafana, and alerting pipeline design.
  • Background working with data engineering teams running ETL,ELT pipelines.

More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 147259305

Similar Jobs

Gurugram, India

Skills:

GithubElkMavenPrometheusKafkaGrafanaDockerTerraformSonarqubeNexusPythonAWSBashNpmZabbixRabbitmqJenkinsGcpLinuxAnsibleDebeziumRedashConsulKong

Remote

Skills:

KubernetesJavapythonDevopslinuxJenkins

Gurugram

Skills:

S3RDSEc2AzureAws

Noida, India

Skills:

Jfrog ArtifactoryAntGrafanaGradlePowerShellGitlabPrometheusNagiosKubernetesBashPythonTerraformGcpDockerLoki

Gurugram, Gurugram, India

Skills:

AWSS3EmrDynamodbPythonLambdaCloudwatchServerlessGlueinfrastructure-as-codeMWAA