We are looking for a skilled
Site Reliability Engineer (SRE) with strong expertise in
Google Cloud Platform (GCP) and
Java-based applications. The ideal candidate will be responsible for ensuring the reliability, scalability, and performance of production systems, while driving automation and operational excellence.
Key Responsibilities
- Ensure high availability, performance, and scalability of applications hosted on GCP.
- Design, build, and maintain reliable and scalable infrastructure using SRE principles.
- Monitor system health using tools like Stackdriver (Cloud Monitoring), Prometheus, Grafana.
- Troubleshoot production issues across services, application layers, and infrastructure.
- Collaborate with development teams to improve application reliability and performance (Java-based systems).
- Implement CI/CD pipelines and automate deployments.
- Develop and maintain runbooks, playbooks, and incident response processes.
- Drive incident management, root cause analysis (RCA), and postmortems.
- Optimize cost, performance, and resource utilization on GCP.
- Implement observability, logging, tracing, and alerting frameworks.
Required Skills
- Strong experience in Google Cloud Platform (GCP) services (Compute Engine, GKE, Cloud Run, BigQuery, Cloud Storage).
- Proficiency in Java/J2EE applications and debugging production issues.
- Experience with containerization (Docker) and orchestration tools like Kubernetes (GKE).
- Hands-on experience in monitoring & logging tools (Prometheus, Grafana, ELK, Cloud Monitoring).
Skills: java,gcp,devops