
Search by job, company or skills
Role: Principal Site Reliability Engineer (SRE) – Data Platforms
Role Summary
Own reliability, support, and operations of enterprise data platforms (Trust3 AI, Snowflake, Databricks)
with a primary focus on Google Cloud Platform (GCP). This is a deeply hands-on Principal SRE role
combining managed services ownership, advanced production engineering, and reliability at scale.
What You'll Do
● Own end-to-end platform lifecycle and managed services delivery: installation, operations,
upgrades, optimization, and continuous platform health
● Take full ownership of critical production incidents with deep debugging, RCA, and permanent fixes
● Troubleshoot complex, cross-system issues across GCP (GKE, IAM, networking), data platforms, and connectors
● Lead performance tuning, scalability optimization, and system hardening for high-throughput systems
● Design and implement automation across deployments, monitoring, and operations
● Manage secrets and secure integrations using Vault (or similar) within platform and CI/CD workflows
● Install, upgrade, and operate Trust3 AI on GCP (GKE) across multi-region environments
● Ensure accurate and reliable enforcement of data access policies
● Build and enhance observability (metrics, logs, alerts) for proactive issue detection
● Eliminate operational toil through continuous reliability improvements
● Own issues end-to-end with strong stakeholder communication and SLA adherence
● Collaborate with Engineering and Product to resolve issues and influence platform improvements
● Lead managed services operations including monitoring, incident prevention, capacity planning,
DR readiness, and service-level outcomes (SLA, uptime, upgrade timelines)
Skills Required
● Cloud: Strong expertise in GCP (GKE, IAM, BigQuery, GCS, VPC, Cloud Monitoring/Logging); AWS/Azure exposure is a plus
● Data Platforms: Snowflake, Databricks, BigQuery
● Infra & CI/CD: Kubernetes, Helm, CI/CD (GitHub Actions, GitLab CI, or similar), Terraform (preferred)
● Scripting: Python / Bash
● Observability: Prometheus, Grafana, ELK
● Security: IAM, RBAC/ABAC, data governance (Trust3 AI/Ranger preferred), secrets management (Vault or similar)
Experience
● 10+ years in SRE / DevOps / Production Engineering
● Strong expertise in debugging distributed systems and complex production environments
● Proven ownership of high-severity incidents and large-scale production systems
● Demonstrated ability to independently solve ambiguous, high-impact technical problems
● Track record of driving reliability, automation, and operational excellence at scale
● Experience running high-throughput, always-on (24x7) systems with large data volumes and strict uptime SLAs
Why This Role
● Principal-level, deeply hands-on IC role (no people management)
● End-to-end ownership of mission-critical data platforms
● Work on complex production challenges across cloud, data, and security layers
● High impact on enterprise data access, governance, and reliability
Important Note
This is a production-first role involving end-to-end incident ownership, deep technical problem solving,
and managed services operations — not a pure DevOps/build-only or people management role.
Job ID: 147133855