Search by job, company or skills

creospan private limited

Site Reliability Engineer

Save
  • Posted 23 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Site Reliability Engineer

Pune - Kharadi (Hybrid 3days/Week Office)

Full time - Creospan

Role Overview

We are seeking a highly motivated Site Reliability Engineer (SRE) with strong expertise in Dynatrace, AWS Cloud, monitoring, observability, and production support. The ideal candidate will be responsible for ensuring application availability, system reliability, performance optimization, and operational excellence across enterprise-scale environments.

This role requires hands-on experience in application monitoring, incident management, troubleshooting, automation, and collaboration with Development, DevOps, and Infrastructure teams to maintain highly available and resilient systems.

Key Responsibilities

Monitoring & Observability

  • Design, develop, and maintain Dynatrace dashboards, alerts, monitoring profiles, and observability solutions.
  • Configure and manage application performance monitoring (APM), infrastructure monitoring, and distributed tracing.
  • Create and maintain operational dashboards, reports, and service health metrics.
  • Establish proactive alerting and monitoring strategies to identify issues before they impact users.

Production Support & Incident Management

  • Monitor application and infrastructure performance to identify bottlenecks, anomalies, and system issues.
  • Investigate and resolve production incidents, defects, and performance-related problems.
  • Participate in critical incident management and on-call support rotations.
  • Perform Root Cause Analysis (RCA) and implement corrective and preventive actions.
  • Ensure adherence to SLAs, SLOs, and operational excellence standards.

AWS Cloud & Infrastructure Reliability

  • Support and maintain cloud-native applications hosted on AWS.
  • Analyze system performance, scalability, and reliability within AWS environments.
  • Collaborate with infrastructure teams to optimize cloud resources and improve system resilience.
  • Support high-availability and disaster recovery strategies.

DevOps & Automation

  • Support CI/CD deployments, release management activities, and production rollouts.
  • Collaborate with DevOps teams to improve deployment automation and operational efficiency.
  • Automate monitoring, reporting, and operational tasks using scripting and automation tools.
  • Continuously improve system reliability through automation and process optimization.

Collaboration & Documentation

  • Work closely with Development, QA, DevOps, and Infrastructure teams to improve application stability.
  • Document troubleshooting procedures, monitoring configurations, runbooks, and standard operating procedures.
  • Participate in architecture and design discussions to improve system observability and reliability.

Required Skills & Qualifications

Core SRE & Production Support

  • Strong experience as a Site Reliability Engineer (SRE), Production Support Engineer, Application Support Engineer, or DevOps Engineer.
  • Hands-on experience supporting business-critical production environments.
  • Strong understanding of Incident Management, Problem Management, Change Management, and RCA processes.

Monitoring & Observability

  • Hands-on experience with Dynatrace.
  • Expertise in dashboard creation, alert configuration, performance monitoring, and observability practices.
  • Experience with monitoring and troubleshooting application, infrastructure, and cloud performance issues.
  • Exposure to tools such as Grafana, Splunk, CloudWatch, AppDynamics, Prometheus, or similar monitoring platforms is preferred.

AWS Cloud

  • Strong hands-on experience with AWS Cloud services.
  • Understanding of cloud architecture, monitoring, logging, and performance optimization.
  • Experience supporting cloud-native and distributed applications.

DevOps & CI/CD

  • Experience with CI/CD tools and DevOps practices.
  • Familiarity with deployment pipelines, release processes, and production change management.
  • Experience working in Agile and DevOps environments.

Application & System Knowledge

  • Good understanding of APIs, Microservices Architecture, and Distributed Systems.
  • Experience troubleshooting application performance and infrastructure-related issues.
  • Understanding of networking fundamentals, web servers, and application architectures.

Scripting & Automation

  • Working knowledge of Python, Bash, Java, or similar scripting/programming languages.
  • Experience automating operational and monitoring tasks is preferred.

Nice to Have

  • Experience with Kubernetes and Docker.
  • Exposure to Infrastructure as Code (Terraform, CloudFormation).
  • Knowledge of Site Reliability Engineering best practices.
  • Experience supporting high-volume enterprise applications.
  • ITIL Foundation certification or equivalent.

More Info

Job Type:
Industry:
Employment Type:

Job ID: 148875713

Similar Jobs

Pune, India

Skills:

AWS EKSHibernateJ2EEFluxDatadogSpringCloudFrontDynatraceSplunkKubernetesGitOpsArgoCDCloudflare

Pune, India

Skills:

containerization OrchestrationPaasPowershellDatabase ManagementvirtualizationIaasRest API DevelopmentAutomation ToolsPythonQuery languagemonitoring and alerting systemsGo

Pune, India

Skills:

Monitoring ToolscloudLinuxDistributed SystemsmetricsKubernetesPythonerror budgetslogstracesSLOsincident governanceobservability

Pune, India

Skills:

YamlContinuous DeliveryBashJsonGcpECSKubernetesPythonAWSPingAMDisaster RecoveryFargateForgeRockConfiguration as CodePingGatewayPingIDMPingDS

Pune, India

Skills:

GolangTerraformLinuxAnsibleHelmKubernetesPythonAWSArgoCD