Site Reliability Engineering (SRE), Datadog, Splunk, Grafana, Continuous Integration / Continuous Delivery (CI/CD), Amazon Web Services, Microsoft Azure, Google Cloud Platform
Description
GSPANN is hiring a Site Reliability Engineer with an expertise in Datadog to design and manage enterprise observability and monitoring solutions. The role focuses on improving system reliability, implementing SLO-driven practices, and driving automation across cloud and distributed environments.
Location: Hyderabad
Role Type: Full Time
Published On: 31 March 2026
Experience: 5+ Years
Share this job
Description
GSPANN is hiring a Site Reliability Engineer with an expertise in Datadog to design and manage enterprise observability and monitoring solutions. The role focuses on improving system reliability, implementing SLO-driven practices, and driving automation across cloud and distributed environments.
Role and Responsibilities
- Design, implement, and maintain monitoring, logging, and distributed tracing solutions using Datadog.
- Build Service Level Objective (SLO), Service Level Agreement (SLA), and status dashboards to provide real-time visibility into system health and performance.
- Collaborate with engineering, infrastructure, and business teams to integrate observability practices into applications and platforms.
- Identify gaps in monitoring coverage and recommend improvements to enhance visibility and reliability.
- Drive automation for efficient collection, storage, and analysis of observability data.
- Support incident response activities, perform root cause analysis (RCA), and contribute to problem management processes.
- Establish and enforce best practices for system reliability and monitoring standards.
- Balance operational support responsibilities with strategic reliability and performance improvement initiatives.
- Analyze system trends to proactively prevent incidents and performance degradation.
- Recommend and implement solutions to improve system reliability, scalability, and resilience.
- Stay updated with industry trends in Site Reliability Engineering (SRE) and observability practices.
- Mentor junior engineers and promote a culture of continuous learning and improvement.
Skills And Experience
- Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.
- 5–8+ years of experience in Software Engineering, Site Reliability Engineering (SRE), or operations roles with a strong focus on observability.
- Demonstrate strong hands-on experience with Datadog, including Application Performance Monitoring (APM), Infrastructure Monitoring, log management, Real User Monitoring (RUM), and Synthetic Monitoring.
- Work with logging platforms, metrics collection systems, and distributed tracing frameworks.
- Define and implement Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets effectively.
- Apply strong analytical and troubleshooting skills to diagnose and resolve complex system issues.
- Communicate effectively and collaborate with cross-functional teams.
- Drive automation initiatives to improve system reliability and operational efficiency.
- Utilize additional monitoring and observability tools such as Splunk, Grafana, AppDynamics, and Prometheus.
- Work with cloud platforms including Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).
- Apply hands-on experience with Kubernetes and container-based observability.
- Implement Infrastructure as Code (IaC) practices using tools such as Terraform and AWS CloudFormation.
- Integrate observability practices into Continuous Integration / Continuous Delivery (CI/CD) pipelines.
- Hold relevant certifications such as Certified Kubernetes Administrator (CKA) or Terraform Associate (preferred).