Responsibilities
Role Summary
Own our secure, multi-account AWS foundation and the MLOps/GenAI platform that powers
clinician matching, document processing, and safety tooling. You blend SRE discipline with ML
platform pragmatism to deliver compliant, observable, and cost-efficient infrastructure.
Key Responsibilities
- Build and operate a secure AWS landing zone (Organizations, Control Tower), VPC
architecture, private networking, and multi-account guardrails.
- Design CI/CD and IaC at scale (GitHub Actions/CodeBuild/CodePipeline, Terraform and/or
AWS CDK); policy-as-code (Open Policy Agent, AWS SCPs).
- Run compute fabrics for services and data: Amazon EKS (preferred) and ECS Fargate;
autoscaling, HPA/Karpenter, cluster security (IRSA, PodSecurity).
- Observability platform: AWS Distro for OpenTelemetry, CloudWatch, Prometheus/Grafana,
X-Ray; golden signals, SLOs, incident response and on-call.
- Security-by-default: IAM least-privilege, KMS envelope encryption, Secrets
Manager/Parameter Store, AWS WAF/Shield, artifact signing, SBOM/SLSA.
- Resiliency engineering: multi-AZ baselines, chaos testing, backup/DR (AWS Backup), game
days; cost management with CUR/Budgets/rightsizing.
- MLOps: SageMaker projects/pipelines, model registry, feature store, inference endpoints;
safe deployment patterns (shadow/canary/AB) and data drift monitoring.
- GenAI: Amazon Bedrock integration (guardrails, content filters, PII redaction), retrieval with
vector indexes (pgvector on Aurora or OpenSearch k-NN).
- Data platform enablement with S3/Lake Formation/Glue/Athena/EMR; secure data paths for
training/serving; governance and auditability.
- Champion DevSecOps: threat modeling, SBOM scanning, container/image hardening, and
secure software supply chain.
Desired Candidate Profile
Required Qualifications
- 7+ years building/operating cloud platforms; deep hands-on with AWS (networking, IAM,
compute, storage, security).
- Strong Terraform and/or AWS CDK skills; GitOps and CI/CD at scale; Linux, containers,
Kubernetes (EKS) in production.
- Operational excellence: SRE practices, SLO/error budgets, incident management, on-call, and
postmortem culture.
- MLOps experience with SageMaker or equivalent; data pipelines for feature engineering;
real-time/batch inference and monitoring.
- Experience with Bedrock/OpenSearch/pgvector for RAG and vector search; understanding of
prompt/response safety and audit trails.
- Security/compliance literacy (GDPR, logging/retention, key management, network isolation).
Nice to Have
- AWS certifications (Solutions Architect Pro, Security, Data/ML).
- Experience with FHIR/HL7 integrations and healthcare-grade identity (OIDC, SMART on
FHIR).
- Background in cost optimization, FinOps, and incident response leadership.
How We Work & Benefits
- Influence the platform architecture end-to-end; work with a small, senior team.
- Remote-friendly; pairing and design reviews; continuous improvement culture.
- Mission with impact: your reliability and ML tooling improve access to care daily.
Compliance & Notes
- All workloads run in EU regions (e.g., eu-central-1); strict data residency and encryption
baselines.
- GenAI usage must be privacy-preserving with opt-in consent and redaction for PHI/PII;
comprehensive audit logs maintained.
Back