Search by job, company or skills

Privacera

Principal Site Reliability Engineer (SRE)

10-12 Years
Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted 3 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Role: Principal Site Reliability Engineer (SRE) – Data Platforms

Role Summary

Own reliability, support, and operations of enterprise data platforms (Trust3 AI, Snowflake, Databricks)

with a primary focus on Google Cloud Platform (GCP). This is a deeply hands-on Principal SRE role

combining managed services ownership, advanced production engineering, and reliability at scale.

What You'll Do

● Own end-to-end platform lifecycle and managed services delivery: installation, operations,

upgrades, optimization, and continuous platform health

● Take full ownership of critical production incidents with deep debugging, RCA, and permanent fixes

● Troubleshoot complex, cross-system issues across GCP (GKE, IAM, networking), data platforms, and connectors

● Lead performance tuning, scalability optimization, and system hardening for high-throughput systems

● Design and implement automation across deployments, monitoring, and operations

● Manage secrets and secure integrations using Vault (or similar) within platform and CI/CD workflows

● Install, upgrade, and operate Trust3 AI on GCP (GKE) across multi-region environments

● Ensure accurate and reliable enforcement of data access policies

● Build and enhance observability (metrics, logs, alerts) for proactive issue detection

● Eliminate operational toil through continuous reliability improvements

● Own issues end-to-end with strong stakeholder communication and SLA adherence

● Collaborate with Engineering and Product to resolve issues and influence platform improvements

● Lead managed services operations including monitoring, incident prevention, capacity planning,

DR readiness, and service-level outcomes (SLA, uptime, upgrade timelines)

Skills Required

● Cloud: Strong expertise in GCP (GKE, IAM, BigQuery, GCS, VPC, Cloud Monitoring/Logging); AWS/Azure exposure is a plus

● Data Platforms: Snowflake, Databricks, BigQuery

● Infra & CI/CD: Kubernetes, Helm, CI/CD (GitHub Actions, GitLab CI, or similar), Terraform (preferred)

● Scripting: Python / Bash

● Observability: Prometheus, Grafana, ELK

● Security: IAM, RBAC/ABAC, data governance (Trust3 AI/Ranger preferred), secrets management (Vault or similar)

Experience

● 10+ years in SRE / DevOps / Production Engineering

● Strong expertise in debugging distributed systems and complex production environments

● Proven ownership of high-severity incidents and large-scale production systems

● Demonstrated ability to independently solve ambiguous, high-impact technical problems

● Track record of driving reliability, automation, and operational excellence at scale

● Experience running high-throughput, always-on (24x7) systems with large data volumes and strict uptime SLAs

Why This Role

● Principal-level, deeply hands-on IC role (no people management)

● End-to-end ownership of mission-critical data platforms

● Work on complex production challenges across cloud, data, and security layers

● High impact on enterprise data access, governance, and reliability

Important Note

This is a production-first role involving end-to-end incident ownership, deep technical problem solving,

and managed services operations — not a pure DevOps/build-only or people management role.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 147133855