Key Responsibilities
- Software & Automation: Develop, test, and maintain high-quality software, frameworks, and automation tools to improve system reliability and reduce manual effort.
- System Design: Collaborate with product and engineering teams to design scalable and resilient infrastructure solutions.
- Incident Management: Lead incident response efforts, troubleshoot issues, and participate in on-call rotations. Create and maintain runbooks.
- Observability & Monitoring: Implement observability strategies using tools likeGrafana, Splunk, and other APM/monitoring solutions. Define SLIs/SLOs for applications and infrastructure.
- CI/CD & DevOps: Design and maintain CI/CD pipelines usingGitHub Actions. Promote DevOps best practices across teams.
- Infrastructure Management: Build and optimize cloud and/or on-prem infrastructure with a focus on scalability, performance, and reliability.
- Security & Compliance: Ensure systems are secure and comply with industry standards. Collaborate with security teams to implement necessary controls.
- Collaboration & Reviews: Participate in code reviews, sprint ceremonies, and design discussions to ensure high-quality deliverables.
- Knowledge Sharing: Document processes and systems comprehensively. Mentor junior engineers and contribute to knowledge-sharing initiatives.
Required Skills & Qualifications
- Programming/Scripting: Proficiency in one or more languages likePython.
- Cloud Expertise: Good working knowledge of at least one major cloud platformMicrosoft Azure or GCP.
- CI/CD: Hands-on experience withGitHub Actionsand version control usingGit.
- Observability: Experience with monitoring tools such asGrafana, Splunk, Prometheus, etc.
- Agile Methodology: Solid understanding of Agile/Scrum processes.
- Incident & Problem Management: Willingness to work in operations and manage live issues.
- Certifications:Azure Fundamentals (AZ-900)or equivalent required.
- Soft Skills: Strong problem-solving, communication, and collaboration skills.