Site Reliability Engineer
Pune - Kharadi (Hybrid 3days/Week Office)
Full time - Creospan
Role Overview
We are seeking a highly motivated Site Reliability Engineer (SRE) with strong expertise in Dynatrace, AWS Cloud, monitoring, observability, and production support. The ideal candidate will be responsible for ensuring application availability, system reliability, performance optimization, and operational excellence across enterprise-scale environments.
This role requires hands-on experience in application monitoring, incident management, troubleshooting, automation, and collaboration with Development, DevOps, and Infrastructure teams to maintain highly available and resilient systems.
Key Responsibilities
Monitoring & Observability
- Design, develop, and maintain Dynatrace dashboards, alerts, monitoring profiles, and observability solutions.
- Configure and manage application performance monitoring (APM), infrastructure monitoring, and distributed tracing.
- Create and maintain operational dashboards, reports, and service health metrics.
- Establish proactive alerting and monitoring strategies to identify issues before they impact users.
Production Support & Incident Management
- Monitor application and infrastructure performance to identify bottlenecks, anomalies, and system issues.
- Investigate and resolve production incidents, defects, and performance-related problems.
- Participate in critical incident management and on-call support rotations.
- Perform Root Cause Analysis (RCA) and implement corrective and preventive actions.
- Ensure adherence to SLAs, SLOs, and operational excellence standards.
AWS Cloud & Infrastructure Reliability
- Support and maintain cloud-native applications hosted on AWS.
- Analyze system performance, scalability, and reliability within AWS environments.
- Collaborate with infrastructure teams to optimize cloud resources and improve system resilience.
- Support high-availability and disaster recovery strategies.
DevOps & Automation
- Support CI/CD deployments, release management activities, and production rollouts.
- Collaborate with DevOps teams to improve deployment automation and operational efficiency.
- Automate monitoring, reporting, and operational tasks using scripting and automation tools.
- Continuously improve system reliability through automation and process optimization.
Collaboration & Documentation
- Work closely with Development, QA, DevOps, and Infrastructure teams to improve application stability.
- Document troubleshooting procedures, monitoring configurations, runbooks, and standard operating procedures.
- Participate in architecture and design discussions to improve system observability and reliability.
Required Skills & Qualifications
Core SRE & Production Support
- Strong experience as a Site Reliability Engineer (SRE), Production Support Engineer, Application Support Engineer, or DevOps Engineer.
- Hands-on experience supporting business-critical production environments.
- Strong understanding of Incident Management, Problem Management, Change Management, and RCA processes.
Monitoring & Observability
- Hands-on experience with Dynatrace.
- Expertise in dashboard creation, alert configuration, performance monitoring, and observability practices.
- Experience with monitoring and troubleshooting application, infrastructure, and cloud performance issues.
- Exposure to tools such as Grafana, Splunk, CloudWatch, AppDynamics, Prometheus, or similar monitoring platforms is preferred.
AWS Cloud
- Strong hands-on experience with AWS Cloud services.
- Understanding of cloud architecture, monitoring, logging, and performance optimization.
- Experience supporting cloud-native and distributed applications.
DevOps & CI/CD
- Experience with CI/CD tools and DevOps practices.
- Familiarity with deployment pipelines, release processes, and production change management.
- Experience working in Agile and DevOps environments.
Application & System Knowledge
- Good understanding of APIs, Microservices Architecture, and Distributed Systems.
- Experience troubleshooting application performance and infrastructure-related issues.
- Understanding of networking fundamentals, web servers, and application architectures.
Scripting & Automation
- Working knowledge of Python, Bash, Java, or similar scripting/programming languages.
- Experience automating operational and monitoring tasks is preferred.
Nice to Have
- Experience with Kubernetes and Docker.
- Exposure to Infrastructure as Code (Terraform, CloudFormation).
- Knowledge of Site Reliability Engineering best practices.
- Experience supporting high-volume enterprise applications.
- ITIL Foundation certification or equivalent.