Software Engineer (Site Reliability Engineer)

2-4 Years

Save

Early Applicant

Job Description

Roles & Responsibilities:

Drive architectural/technical service improvements that are cutting across all areas working with different teams.
Call out major risks and issues from a reliability perspective using data to make informed decisions and drive mitigation plans working with different teams.
Design and build features and tools to support performance and capacity planning.
Monitor and continually improve the capacity of our production environments infrastructure aligned to the Application performance.
Identify and define SLOs, SLAs and SLIs from a reliability perspective and ensure automation and analytics drive approach.
Identify sources to gather diagnostic information and provide solutions.
Improve engineering standards, tooling, and processes working closely with multiple teams.
Challenge the status quo and determination to lead through change using a bold, fail-fast mentality.

Expectation from candidate role:

2 - 3 years of experience
Professional experience with any of the cloud environments (Azure / AWS / Google)
Experience ensuring the resilience of scalable cloud native solutions
Experience working with infrastructure as code tools like - Terraform, Ansible, Puppet, Chef, etc
Experience working with CI/CD tools like - Jenkins, Git, Github, etc
Experience working with SQL and NoSQL databases
Exposure on centralized logging solutions like - Graylog, Splunk, Elk, etc
Exposure on active monitoring solutions like Grafana, New Relic, DataDog, Prometheus, etc
Strong scripting skills (Bash, Powershell)
Strong knowledge on container technologies like Docker, Kubernetes
Knowledge of networking principles & understanding of IT security best practices [firewalls, load balancing, routing and switching]
Knowledge of horizontal and vertical scaling best practices
Good interpersonal, communication and organizational skills

Must have Skills:

Good to have Skills:

Knowledge of networking principles & understanding of IT security best practices [firewalls, load balancing, routing and switching]
Exposure on active monitoring solutions like Grafana, New Relic, DataDog, Prometheus, etc,
Knowledge of horizontal and vertical scaling best practices