Key Responsibilities
- Ensure high availability, performance, and reliability of production systems
- Define and manage SLIs, SLOs, and error budgets
- Lead incident response, root cause analysis (RCA), and post-incident reviews
- Proactively identify and mitigate reliability risks
Software Eng & Automation
- Develop software solutions to automate operational tasks
- Build and maintain tools, frameworks, and platforms for deployment, monitoring, and reliability
- Reduce toil through automation and self-service systems
- Write clean, maintainable, and testable code
Infrastructure & Cloud
- Design and manage cloud-native infrastructure (AWS, Azure, or GCP)
- Implement Infrastructure as Code (IaC) using tools like Terraform or CloudFormation
- Support containerized workloads using Docker and Kubernetes
- Optimize systems for scalability and cost efficiency
Observability & Performance
- Implement monitoring, logging, and tracing solutions
- Analyze system metrics to improve performance and capacity planning
- Establish dashboards and alerts aligned with business impact
CI/CD & DevOps Practices
- Build and maintain CI/CD pipelines
- Collaborate with development teams to improve release safety
- Promote best practices in testing, deployment, and rollback strategies
Collaboration & Culture
- Partner with product and engineering teams to design reliable solution adhering best practices of architectures
- Advocate for SRE best practices and reliability-first mindset
- Contribute to documentation and knowledge sharing
Required Qualifications
- Bachelor's degree in Computer Science or equivalent
- Strong programming skills in Python, Java, Go, or similar languages
- Experience with distributed systems and microservices
- Hands-on experience with Linux/Unix systems
- Familiarity with cloud platforms (AWS, Azure, or GCP)
- Exp with containers and orchestration (Docker, Kubernetes)
- Knowledge of CI/CD tools (GitHub Actions, Jenkins, GitLab CI, etc.)
Preferred Qualifications
- Experience in an SRE or DevOps role
- Knowledge of service meshes, load balancing, and traffic management
- Experience with chaos engineering or resilience testing
- Background in security best practices (IAM, secrets management)
- Experience supporting regulated or mission-critical systems