Roles and Responsibilities :
- Own the Infrastructure, APM and work with Developers and Systems engineers to Build, Release, Monitor and run the services reliability exceeding the agreed SLAs.
- Write software to automate API-driven tasks at scale and contribute to the product codebase in Java, JS, React, Node, Go and Python
- Write automation to reduce toil and eliminate manual tasks that are repeatable.
- Work with Ansible, Puppet, Chef, Terraform or another config management / orchestration suite, know where it's broken, work towards fixing them and explore new alternatives
- Define and accelerate implementation of support processes, tools and best practices
- Maintain services once they are live by measuring and monitoring availability, latency and overall system reliability
- Handle cross team performance issues from identification of the cause, determining the areas of improvement and driving those actions to closure
- Performance and maturity baselining of Systems, tools maturity & coverage, metrics, technology and engineering practices
- Define, Measure and improve Reliability Metrics (SLO/SLI), Observability (Monitoring, Logging-Tracing solutions), Ops process (Incident, Problem Mgmt) and streamline automate release management.
- Build dashboards to provide visibility into performance of the applications.
- Create chaos in the production environment purposefully in a controlled manager to validate reliability of systems.
- Mentor and coach other SREs in the organization
- Provide written and verbal updates to executives and the stakeholders of the application in the organization.
- Understand the current process, system setup and propose the improvements needed in the processes, and technology so that the application exceeds the desired Service Level Objective.
- Strong believer of automation to bring in sustained continuous improvement by automating Toil, Runbooks, improving ability of the applications to auto heal leading to improved reliability
Must Have Skills :
- The successful candidate will have the following attributes/qualifications :
- 15+ years of experience in Development and Operations of applications/services in production that has uptime over 99.9%.
- 8+ years of experience as a SRE in handling applications that are web scale
- Strong hands-on coding experience in one or more programming languages such as Python, Golang, Java, Bash, etc.
- Good understanding of Observability (monitoring, logging, tracing, metrics), Chaos engineering concepts.
- Proficiency in using Observability tools (example : New Relic, Datadog, etc) for monitoring, logging, tracing.
- Expert level hands on knowledge in public cloud platform AWS and/or Google Cloud Platform. Professional level certificate on one of the public clouds is highly desirable.
- Must have hands-on experience in using configuration management systems such as Ansible or SaltStack and infrastructure automation tools like Terraform or CloudFormation.
- Should have used altering systems such as Pager Duty.
- Should have implemented solutions around Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for services.
- Measurement should have been within a system and across systems in distributed systems
- Should have supported Production Incidents (PIs) on critical applications of a company. Troubleshoot, debug, and diagnose operational issues and drive them to closure.
- Understanding of software delivery life cycles, particularly Agile/Lean & DevOps - Proven experience in handling large scale and growing infrastructure across Data Centers and heterogeneous Cloud platforms
- Experience as a service owner in managing large geographically diverse stakeholders
- Ability to work with creative fast growing engineering team and motivate them to deliver their best work
- History of driving innovation.
Good to Have Skills :
- Familiarity with handling :
- Containerization Kubernetes, Docker, Rancher, etc
- Kafka, Yarn, ElasticSearch etc.
- Source code management and Implementation of Security best practices.
- Tech Stack - Python, Falcon, Elastic Search, MongoDB, AWS (SQS S3), Map Reduce.
- Networking knowledge
- Understanding of software delivery life cycles, particularly Agile/Lean & DevOps
- Contribution to open source community
- Qualification :Masters or Bachelors degree in Computer Science Engineering, or a related technical degree.