Work with development teams to ensure that applications have scalability and reliability built-in from day one- agile is second nature to you and you re excited to work in scrum teams and represent the SRE perspective
Design and enhance software architecture to improve scalability, service reliability, cost, and performance- you ve helped create services that are critical to their customers success
Deploy automation for provisioning and operating infrastructure at large scale. You are experienced in Infrastructure as Code concepts and have put them into production
Partner with teams to improve CI/CD processes and technology - Helping teams in delivering value early is what you strive for
Mentor members of the staff on large scale cloud deployments- you re an expert in deploying in the cloud and can bring a teaching mindset to help others benefit from your experience
Drive the adoption of observability practices and a data-driven mindset- you love metrics, graphs, and gaining a deep understanding of why things happen in a system, helping others gain visibility into the things they build
Setup processes like on-call rotations, Postmortems, runbooks to continue supporting the infrastructure owned by the SRE team while finding ways to reduce the time to resolution and improve the reliability of services
Support, optimize and deploy mission critical, front-end and back-end production
Improving site performance, monitoring, and overall stability of our infrastructure
Qualifications
Your Experience
Bachelors/Masters degree in Computer Science or a related field or equivalent military experience required
5+ years of industry experience in engineering
Fluent Scripting skills preferably Python or Bash
3+ years of working with Microservices architectures on Kubernetes
HandsOn experience with container native tools like Helm, Istio, Vault for managing workloads running in Kubernetes
Experience with public cloud (AWS or GCP/Google cloud or Azure) at medium to large scale
Proficient in CI/CD platforms like GitlabCI, Jenkins, CircleCI etc
In-depth knowledge of operating systems (processes, threads, concurrency, etc)
Excellent experience working with Unix/Linux systems from kernel to shell and beyond
Drive enhancement of observability by implementing distributed tracing, logging standards, dashboard standardization, profiling, and other relevant practices to meet current Service Level Objectives (SLOs)
HandOn experience with Monitoring tools - Prometheus, Grafana etc.
Expertise in designing, analyzing, and troubleshooting large-scale distributed systems
Experience with RabbitMQ, Kafka, Postgres tuning and performance a huge plus
Lead the long-term strategy on critical components like Kafka, ElasticSearch, Postgres, MongoDB etc, evaluating options for either reliable self-hosted or managed solutions - HandOn production experience with at least one of these is required
The exceptional communicator in and across teams, taking the lead