Core Responsibilities
- Automation & Coding: Developing scripts and tools (Python, Go, Java, Bash) to automate operational tasks and eliminate manual, repetitive work.
- System Monitoring & Alerting: Using tools like Prometheus, Grafana, Datadog, or ELK Stack to monitor system health, latency, and error rates.
- Incident Management: Responding to production incidents, performing root cause analysis (RCA), and conducting blameless post-mortems.
- Capacity Planning & Scaling: Managing infrastructure capacity and performance to ensure scalability, often using cloud platforms like AWS, GCP, or Azure.
- Collaboration: Working with development teams to improve service performance, reliability, and deployment procedures.
Required Skills and Qualifications
- BE/ B Tech with 10+ Years experience as a SRE
- Ready for contract role in rotational shift (4 AM, 2 PM) at Pune
- Programming: Proficiency in at least one scripting or programming language (Python, Go, Ruby).
- Infrastructure & Tools: Experience with Kubernetes, Docker, and Infrastructure as Code (IaC) tools like Terraform or Ansible.
- System Administration: Strong knowledge of Linux/Unix operating systems and networking protocols (TCP/IP, DNS).
- Experience: Usually requires a degree in Computer Science or equivalent experience, often with a background in software development or system administration.
Typical SRE Job Profile Summary
- Role: Site Reliability Engineer
- Industry: Technology, Software Development, Finance
- Experience Level: Mid-Senior Level
- Key Focus: Reliability, Automation, Performance,