Job Responsibilities
- Engage in and improve the lifecycle of services from conception to End-of-Life (EOL), including: system design consulting and capacity planning.
- Define and implement standards and best practices related to: System Architecture, Service delivery, metrics, and the automation of operational tasks.
- Support services, product & engineering teams by providing common tooling and frameworks to deliver increased availability and improved incident response.
- Improve system performance, application delivery, and efficiency through automation, process refinement, postmortem reviews, and in-depth configuration analysis.
- Collaborate closely with engineering professionals within the organization to deliver reliable services.
- Increase operational efficiency, effectiveness, and quality of services by treating operational challenges as a software engineering problem (reduce toil).
- Guide junior team members and serve as a champion for Site Reliability Engineering.
- Actively participate in incident response, including on-call responsibilities.
- Partner with stakeholders to influence and help drive the best possible technical and business outcomes.
Required Qualifications
- Engineering degree, or a related technical discipline, or equivalent work experience.
- Experience coding in higher-level languages (e.g., Python, JavaScript, C++, or Java).
- Knowledge of Cloud-based applications & Containerization Technologies.
- Demonstrated understanding of best practices in metric generation and collection, log aggregation pipelines, time-series databases, and distributed tracing.
- Working experience with industry standards like Terraform, Ansible.
- Demonstrable fundamentals in 2 of the following: Computer Science, Cloud architecture, Security, or Network Design fundamentals.
(Experience, Education, Certification, License and Training)
- Must have at least 5 years of hands-on experience working in Engineering or Cloud.
- Minimum 5 years experience with public cloud platforms (e.g., GCP, AWS, Azure).
- Minimum 3 years experience in configuration and maintenance of applications and/or systems infrastructure for a large-scale customer-facing company.
- Experience with distributed system design and architecture.