Unified Dashboards, Elastic Stack (ELK), Loki, Splunk, Dynatrace, Datadog, Grafana, New Relic, Azure, Python, GitLab, Jenkins, Ansible, Terraform, DevOps, SLO/SLAs Monitoring, Incident Response, Root Cause Analysis (RCA), E2E Implementation
Description
GSPANN is hiring a Senior Site Reliability Engineer (SRE) to join our team in Pune or Hyderabad. This full-time role focuses on enhancing the reliability, scalability, and observability of global cloud-based systems through automation, performance tuning, and modern DevOps practices.
Location: Pune / Hyderabad
Role Type: Full Time
Published On: 30 May 2025
Experience: 6 - 10 Years
Share this job
Description
GSPANN is hiring a Senior Site Reliability Engineer (SRE) to join our team in Pune or Hyderabad. This full-time role focuses on enhancing the reliability, scalability, and observability of global cloud-based systems through automation, performance tuning, and modern DevOps practices.
Role and Responsibilities
- Manage and support production environments on cloud platforms, with a strong preference for Microsoft Azure.
- Apply expertise in observability tools such as Dynatrace, Splunk, Datadog, Grafana, and New Relic to monitor system health.
- Implement modern observability practices including end-to-end (E2E) instrumentation, telemetry, and unified dashboard creation.
- Drive organizational change by influencing senior leadership and improving SRE practices company-wide.
- Write automation scripts using Python (strongly preferred) to streamline operations and eliminate manual effort.
- Deploy cloud infrastructure using tools like Ansible, Terraform, and Azure DevOps.
- Work confidently with Continuous Integration/Continuous Deployment (CI/CD) tools such as GitLab, Jenkins, Bamboo, Travis CI, and CircleCI.
- Operate and orchestrate containerized environments using Kubernetes and Docker.
- Troubleshoot complex issues and provide reliable, scalable solutions.
- Embrace continuous learning and demonstrate a strong passion for automation and process improvement.
- Use logging stacks like ELK (Elasticsearch, Logstash, and Kibana), Loki, and Splunk to maintain visibility and traceability.
- Influence organizational adoption of Infrastructure as Code (IaC) and CI/CD methodologies.
- Define and monitor Service Level Objectives (SLOs) and Service Level Agreements (SLAs).
- Lead incident response efforts and perform Root Cause Analysis (RCA) to minimize recurrence.
Skills And Experience
- Bachelor's degree in Computer Science, Information Science, Engineering, or a related discipline.
- 6+ years of experience in Site Reliability Engineering (SRE) or DevOps roles, with a focus on cloud-based production systems.
- Ensure the availability, low latency, performance, and cost efficiency of global e-commerce platforms.
- Design and maintain full-stack observability solutions, including dashboards and standardized instrumentation.
- Implement advanced monitoring and alerting systems tailored for both internal engineering teams and external stakeholders.
- Advocate for SRE best practices and promote operational excellence across teams and departments.
- Collaborate with engineering, product, and operations teams to increase reliability and accelerate delivery timelines.
- Build automation tools that support incident response, system recovery, and software delivery pipelines.
- Track and maintain error budgets, achieve defined SLOs, and guarantee high uptime for mission-critical services.
- Identify system bottlenecks and anomalies proactively, ensuring optimal performance under peak loads.
- Automate infrastructure management to reduce costs and scale efficiently during traffic surges.
- Lead strategic, cross-functional initiatives that enhance overall system architecture and reliability.