Search by job, company or skills

B

Senior Cloud / DevOps / AI/ML Engineer (AWS Platform & MLOps)

new job description bg glownew job description bg glownew job description bg svg
  • Posted 13 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Responsibilities

Role Summary

Own our secure, multi-account AWS foundation and the MLOps/GenAI platform that powers

clinician matching, document processing, and safety tooling. You blend SRE discipline with ML

platform pragmatism to deliver compliant, observable, and cost-efficient infrastructure.

Key Responsibilities

  • Build and operate a secure AWS landing zone (Organizations, Control Tower), VPC

architecture, private networking, and multi-account guardrails.

  • Design CI/CD and IaC at scale (GitHub Actions/CodeBuild/CodePipeline, Terraform and/or

AWS CDK); policy-as-code (Open Policy Agent, AWS SCPs).

  • Run compute fabrics for services and data: Amazon EKS (preferred) and ECS Fargate;

autoscaling, HPA/Karpenter, cluster security (IRSA, PodSecurity).

  • Observability platform: AWS Distro for OpenTelemetry, CloudWatch, Prometheus/Grafana,

X-Ray; golden signals, SLOs, incident response and on-call.

  • Security-by-default: IAM least-privilege, KMS envelope encryption, Secrets

Manager/Parameter Store, AWS WAF/Shield, artifact signing, SBOM/SLSA.

  • Resiliency engineering: multi-AZ baselines, chaos testing, backup/DR (AWS Backup), game

days; cost management with CUR/Budgets/rightsizing.

  • MLOps: SageMaker projects/pipelines, model registry, feature store, inference endpoints;

safe deployment patterns (shadow/canary/AB) and data drift monitoring.

  • GenAI: Amazon Bedrock integration (guardrails, content filters, PII redaction), retrieval with

vector indexes (pgvector on Aurora or OpenSearch k-NN).

  • Data platform enablement with S3/Lake Formation/Glue/Athena/EMR; secure data paths for

training/serving; governance and auditability.

  • Champion DevSecOps: threat modeling, SBOM scanning, container/image hardening, and

secure software supply chain.

Desired Candidate Profile

Required Qualifications

  • 7+ years building/operating cloud platforms; deep hands-on with AWS (networking, IAM,

compute, storage, security).

  • Strong Terraform and/or AWS CDK skills; GitOps and CI/CD at scale; Linux, containers,

Kubernetes (EKS) in production.

  • Operational excellence: SRE practices, SLO/error budgets, incident management, on-call, and

postmortem culture.

  • MLOps experience with SageMaker or equivalent; data pipelines for feature engineering;

real-time/batch inference and monitoring.

  • Experience with Bedrock/OpenSearch/pgvector for RAG and vector search; understanding of

prompt/response safety and audit trails.

  • Security/compliance literacy (GDPR, logging/retention, key management, network isolation).

Nice to Have

  • AWS certifications (Solutions Architect Pro, Security, Data/ML).
  • Experience with FHIR/HL7 integrations and healthcare-grade identity (OIDC, SMART on

FHIR).

  • Background in cost optimization, FinOps, and incident response leadership.

How We Work & Benefits

  • Influence the platform architecture end-to-end; work with a small, senior team.
  • Remote-friendly; pairing and design reviews; continuous improvement culture.
  • Mission with impact: your reliability and ML tooling improve access to care daily.

Compliance & Notes

  • All workloads run in EU regions (e.g., eu-central-1); strict data residency and encryption

baselines.

  • GenAI usage must be privacy-preserving with opt-in consent and redaction for PHI/PII;

comprehensive audit logs maintained.

Back

More Info

Job Type:
Industry:
Employment Type:

Job ID: 142727341