We at Innovaccer are looking for a Site Reliability Engineer II to build the most amazing product experience. you'll get to work with other engineers to build delightful feature experiences to understand and solve our customer s pain points. In this role you will be responsible for building/automating secure cloud Infrastructure (Infrastructure As A Code - IaaC) with various pillars Cost, Reliability, Scalability, Performance, Cost etc
A Day in the Life
- In this role you will design, architect various domains of SRE.
- You will extensively collaborate with different teams and drive various initiatives and SRE best practices adoption.
- In this role you will be responsible for building/automating secure cloud Infrastructure (Infrastructure As A Code - IaaC) with various pillars Cost, Reliability, Scalability, Performance, Cost etc
- Build CICD stack collaborating across Dev and QA/Automation team and drive organization to new level of (daily/hourly) continuous delivery and deployment.
- Security is paramount to everything we do, you will work closely with CISO, Dev team(s) and make security as first class citizens. Develop S-CICD (Secure CICD), enable various security tool chains and vulnerability reports to developers via automation.
- Observability is very critical for the scale of our systems and ability to find insights/behavior, detect problem/failures. Looking for leads to drive this charter spanning across logs, metrics, mesh, tracing etc
- Collaborate closely with Dev and QA team to bring given initiative to a closer, increase adoption of DevOps practices and tool chain.
- Apply strong analytical skills to understand production system metrics, drive change, optimize system utilization and drive cost efficiency.
- Auto scale/down the platform during peak season scenarios.
- Ensure that the Platform is secured as per guidelines established by CISO. e,g, Secure against DDoS attacks by implementing WAF, Vulnerability and Patch management, install required security agents etc
- Lead least privilege based RBAC for various production services and tool chains.
- Build and execute Disaster Recovery plan.
- Key stakeholder to participate incase of IR (Incident Response).
What You Need
- 5+ years experience as a DevOps/SRE Engineer.
- Solid experience with at least one of the clouds with automation focus - AWS, Azure, GCP. Certification has advantages.
- Hands-on experience with Kubernetes along with Linux.
- Programming experience with scripting languages eg Python.
- Build and deployment experience building scalable CICD architectures and solutions is preferred.
- Building observability stack from logs, metrics, traces, service mesh, data observability is preferred.
- Good at documenting and structuring documents for consumption by various dev teams.
- Cloud Security is a major advantage and highly preferred skill.
- Hands-on experience with a few of these - Kafka, Postgre, Snowflake etc is preferred.
Preferred Skills:
- Multi Cloud: AWS, Azure, GCP
- Distributed Compute: Kubernetes (EKS/AKS), Containerization
- Persistence stores Postgres, MongoDB
- Data Warehousing Snowflake, Data Bricks
- Messaging Kafka
- CICD Jenkins, ArgoCD, GitOps
- Observability Elasticsearch, Prometheus, Jaeger, NewRelic etc