Site Reliability Engineer - Datadog

gspann technologies, inc

Hyderabad, India

5-7 Years

Save

Posted 10 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Site Reliability Engineering (SRE), Datadog, Splunk, Grafana, Continuous Integration / Continuous Delivery (CI/CD), Amazon Web Services, Microsoft Azure, Google Cloud Platform

Description

GSPANN is hiring a Site Reliability Engineer with an expertise in Datadog to design and manage enterprise observability and monitoring solutions. The role focuses on improving system reliability, implementing SLO-driven practices, and driving automation across cloud and distributed environments.

Location: Hyderabad

Role Type: Full Time

Published On: 31 March 2026

Experience: 5+ Years

Share this job

Description

GSPANN is hiring a Site Reliability Engineer with an expertise in Datadog to design and manage enterprise observability and monitoring solutions. The role focuses on improving system reliability, implementing SLO-driven practices, and driving automation across cloud and distributed environments.

Role and Responsibilities

Design, implement, and maintain monitoring, logging, and distributed tracing solutions using Datadog.
Build Service Level Objective (SLO), Service Level Agreement (SLA), and status dashboards to provide real-time visibility into system health and performance.
Collaborate with engineering, infrastructure, and business teams to integrate observability practices into applications and platforms.
Identify gaps in monitoring coverage and recommend improvements to enhance visibility and reliability.
Drive automation for efficient collection, storage, and analysis of observability data.
Support incident response activities, perform root cause analysis (RCA), and contribute to problem management processes.
Establish and enforce best practices for system reliability and monitoring standards.
Balance operational support responsibilities with strategic reliability and performance improvement initiatives.
Analyze system trends to proactively prevent incidents and performance degradation.
Recommend and implement solutions to improve system reliability, scalability, and resilience.
Stay updated with industry trends in Site Reliability Engineering (SRE) and observability practices.
Mentor junior engineers and promote a culture of continuous learning and improvement.

Skills And Experience

Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.
5–8+ years of experience in Software Engineering, Site Reliability Engineering (SRE), or operations roles with a strong focus on observability.
Demonstrate strong hands-on experience with Datadog, including Application Performance Monitoring (APM), Infrastructure Monitoring, log management, Real User Monitoring (RUM), and Synthetic Monitoring.
Work with logging platforms, metrics collection systems, and distributed tracing frameworks.
Define and implement Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets effectively.
Apply strong analytical and troubleshooting skills to diagnose and resolve complex system issues.
Communicate effectively and collaborate with cross-functional teams.
Drive automation initiatives to improve system reliability and operational efficiency.
Utilize additional monitoring and observability tools such as Splunk, Grafana, AppDynamics, and Prometheus.
Work with cloud platforms including Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).
Apply hands-on experience with Kubernetes and container-based observability.
Implement Infrastructure as Code (IaC) practices using tools such as Terraform and AWS CloudFormation.
Integrate observability practices into Continuous Integration / Continuous Delivery (CI/CD) pipelines.
Hold relevant certifications such as Certified Kubernetes Administrator (CKA) or Terraform Associate (preferred).