The GCP / SRE / DevOps Engineer is responsible for maintaining the stability, performance, and automation of our cloud infrastructure. In this role, you will focus on the technical execution of Data in Flight health, ensuring that streaming and batch data pipelines operate with zero message loss. You will work closely with development teams to build and maintain CI/CD pipelines, manage containerized applications on GKE, and implement robust monitoring solutions. This is an engineering-heavy role where you will proactively identify bottlenecks and replace manual operational tasks with automated, scalable solutions within the Google Cloud ecosystem.
Responsibilities
- Define and monitor SLOs, SLIs, and error budgets for streaming and batch data platforms
- Set up and manage observability dashboards in Grafana and Cloud Monitoring to track the Four Golden Signals.
- Participate in on-call rotations via PagerDuty, providing rapid response to system alerts and executing technical fixes during incidents.
- Conduct post-incident reviews and drive reliability improvements.
- Implement distributed tracing using OpenTelemetry to diagnose latency and message loss across services.
- Monitor and optimize cloud costs using GCP billing insights and FinOps best practices.
- Automate operational runbooks to reduce manual intervention.
- Manage and monitor GKE clusters, focusing on pod reliability, resource scaling, and network configuration.
- Monitor the availability and latency of Kafka brokers, Pub/Sub topics, and AlloyDB instances to ensure persistent data integrity.
- Support the configuration and health of Apigee gateways to ensure secure and efficient API traffic.
- Manage the operational health of BigQuery, Dataflow jobs, and Cloud Composer DAGs to support reliable data processing.
- Develop and maintain cloud infrastructure using Terraform, ensuring all environments are version-controlled and reproducible.
- Build and support automated CI/CD workflows using GitHub & GitHub Actions to streamline deployments for Cloud Functions, Dataflow, and SpringBoot microservices.
Qualifications
- 35 years of experience in DevOps or Site Reliability Engineering, with a strong focus on Google Cloud Platform (GCP).
- Hands-on experience with Grafana, Cloud Monitoring, and PagerDuty.
- Understanding of streaming technologies like Kafka or Pub/Sub and experience supporting relational databases like AlloyDB.
- Familiarity with SpringBoot microservices, BigQuery, and Cloud Composer.
- Proficient in Docker and Kubernetes (GKE), including experience with container orchestration and lifecycle management.
- Solid experience with Terraform and GitHub Actions. Familiarity with Shell scripting or Python for automation is essential.
- Good understanding of SQL and experience in any programming language.
- Strong troubleshooting skills with the ability to analyze logs and performance metrics to identify root causes.