Job Description: Site Reliability Engineer
For this position, we're looking for talented & experienced engineers who have a passion for infrastructure & automation.
As a Site Reliability Engineer (SRE), you will work within the development team to combine software and systems engineering and run large-scale distributed systems. You will also maintain the client's systems capacity and performance.
Responsibilities
- Taking part in architecture-level discussions, design, planning, and implementation.
- Researching to ensure what we are building is always the best path forward.
- Documenting each project to facilitate integration for users.
- Driving proof of concepts and minimal viable products for demonstration.
- Designing and delivery of Infrastructure as Code.
- Developing and implement automation for routine tasks, including alerting, system monitoring, and response mechanisms.
- Developing and maintaining dashboards for monitoring and observability.
- Supporting multiple services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, and launch reviews.
- Incident management and participating in on call rotation.
Education And Experience
- To succeed in this role, candidates must have a strong foundational knowledge and demonstrated proficiency of Linux/Unix. (Talos)
- At least 5 years of SRE or similar experience as a DevOps or Software Engineer.
- At least two years of programming experience in a conventional programming language.
- Kubernetes knowledge is required. Experience with bare metal / non-managed Kubernetes would be a plus.
- Experience in Python and other scripting languages.
- Experience with infrastructure-as-code and configuration management tools (e.g., Terraform, Ansible, Helm, Puppet, or Chef).
- Networking and cloud computing platform experience.
- Proficiency in scripting and programming languages (e.g., Bash, Python, Go, Node, Java, or similar).
- Familiarity with monitoring, logging, and alerting tools (e.g., Prometheus, Grafana, ELK Stack, or similar).
- Experience with Grafana Mimir.
- Familiarity with CI/CD tools and SDLC practices.
- You have strong problem-solving skills and excellent communication skills.
- You can work independently as well as collaboratively in a remote team environment.
You are friendly, collaborative, humble, honest, and always s