Job Description
Project Role : Quality Engineering Lead
Project Role Description : Leads a team of quality engineers through multi-disciplinary team planning and ecosystem integration to accelerate delivery and drive quality across the application lifecycle. Applies business and functional knowledge to develop end-to-end testing strategies through the use of quality processes and methodologies. Applies testing methodologies, principles and processes to define and implement key metrics to manage and assess the testing process including test execution and defect resolution.
Must have skills : Cloud Resilience Quality Engineering
Good to have skills : NA
Minimum 5 Year(s) Of Experience Is Required
Educational Qualification : 15 years full time education
Summary:
As a Senior Resilience & Chaos Engineering Specialist, you will be responsible for validating and strengthening system reliability by introducing controlled failure experiments within modern DevOps delivery pipelines. You will work closely with DevOps, SRE, and platform teams to design chaos experiments, validate system behavior under failure scenarios, and ensure applications remain resilient, scalable, and recoverable under real-world conditions. Your role will focus on embedding resilience validation into CI/CD pipelines and improving system reliability metrics such as MTTR, availability, and fault tolerance.
Roles & Responsibilities:
Act as SME for Resilience Engineering and Chaos Testing practices within delivery teams.
Conduct resilience assessments to identify failure modes, dependencies, and system vulnerabilities.
Design and execute chaos experiments across application, infrastructure, and Kubernetes environments.
Integrate chaos testing into CI/CD pipelines to validate resilience continuously during deployments.
Validate failover behavior, recovery time (MTTR), and system stability under fault conditions.
Analyze system behavior using observability tools (metrics, logs, traces) to identify reliability improvements.
Collaborate with DevOps, SRE, and platform engineering teams to improve reliability architecture.
Build and maintain reusable chaos experiment libraries, resilience playbooks, and automation frameworks.
Drive adoption of resilience engineering best practices across projects.
Professional & Technical Skills:
Must Have Skills: Chaos Engineering tools (Litmus Chaos, Gremlin, Harness Chaos Engineering, Chaos Toolkit).
Strong experience with Kubernetes, Docker, and container-based architectures.
Experience integrating chaos testing within CI/CD pipelines using Jenkins, GitHub Actions, GitLab, or Harness.
Hands-on experience with observability platforms such as Prometheus, Grafana, OpenTelemetry, Dynatrace, Datadog, AppDynamics, or New Relic.
Working knowledge of cloud platforms (AWS / Azure / GCP) and distributed system architecture.
Experience with performance testing tools (JMeter, K6, Gatling, or Locust) to validate system behavior under load.
Programming or scripting experience in Python, Go, or Java.
Understanding of SRE practices, reliability metrics, and incident analysis.
Additional Information:
Minimum 8+ years of experience in DevOps, SRE, Reliability Engineering, or Performance Engineering.
Certifications in Cloud (AWS/Azure/GCP), Kubernetes (CKA/CKAD), Chaos Engineering, or Observability are preferred.
Position is based out of Bangalore office only.
15 years full time education required.