Job Details:
Job Title: Lead Site Reliability Engineer (SRE)
Duration: Contract to Hire (On the Payroll of Datum Technology Group)
Location: Chennai || Mumbai || Gurugram
Interview Process: Virtual (2 Rounds) + 1 Technical screening.
Job Description:
- We are seeking a highly skilled and experienced Lead Site Reliability Engineer (SRE) to drive reliability, scalability, and performance across our cloud infrastructure, with a strong emphasis on cloud security, compliance, networking, and operating systems expertise.
- This role blends reliability engineering with security best practices to ensure our cloud infrastructure is not only scalable and resilient but also secure and compliant.
Responsibilities:
- Develop and maintain Infrastructure as Code (IaC) using Terraform, including advanced module design and best practices for highly complex environments.
- Design and optimize CI/CD pipelines with a focus on automation, scalability, and deployment efficiency. Ability to discuss and implement pipeline optimizations from prior experience.
- Collaborate with development teams to integrate security and observability tools into CI/CD pipelines, automating security checks.
- Troubleshoot and debug networking issues, including deep understanding of networking layers, components, and configurations across cloud and hybrid environments.
- Administer and optimize Linux-based operating systems, including troubleshooting, performance tuning, and implementing best practices for security and reliability.
- Address vulnerabilities in code libraries and infrastructure (e.g., OS packages) through patching and remediation.
- Partner with application teams to resolve specific security findings and improve overall system resilience.
Requirements:
- 9+ years of experience in DevOps, Site Reliability Engineering (SRE), or Cloud Engineering.
- Some experience into leading or managing a team of engineers.
- Deep knowledge of networking fundamentals, Linux operating systems, and CI/CD optimization strategies.
- Very strong expertise in writing complex Terraform code, including advanced module design and best practices for large-scale, highly complex environments.
- Proficiency in scripting or programming languages (e.g., Python, Bash, Go).
- Hands-on experience with Azure cloud platform
Bonus/Preferred Skills:
- Experience with Docker and Kubernetes for containerization and orchestration.