Search by job, company or skills

N

Site Reliability Architect - Configuration Management

15-17 Years
new job description bg glownew job description bg glownew job description bg svg
  • Posted 3 days ago
  • Over 50 applicants
Quick Apply

Job Description

Roles and Responsibilities :

  • Own the Infrastructure, APM and work with Developers and Systems engineers to Build, Release, Monitor and run the services reliability exceeding the agreed SLAs.
  • Write software to automate API-driven tasks at scale and contribute to the product codebase in Java, JS, React, Node, Go and Python
  • Write automation to reduce toil and eliminate manual tasks that are repeatable.
  • Work with Ansible, Puppet, Chef, Terraform or another config management / orchestration suite, know where it's broken, work towards fixing them and explore new alternatives
  • Define and accelerate implementation of support processes, tools and best practices
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system reliability
  • Handle cross team performance issues from identification of the cause, determining the areas of improvement and driving those actions to closure
  • Performance and maturity baselining of Systems, tools maturity & coverage, metrics, technology and engineering practices
  • Define, Measure and improve Reliability Metrics (SLO/SLI), Observability (Monitoring, Logging-Tracing solutions), Ops process (Incident, Problem Mgmt) and streamline automate release management.
  • Build dashboards to provide visibility into performance of the applications.
  • Create chaos in the production environment purposefully in a controlled manager to validate reliability of systems.
  • Mentor and coach other SREs in the organization
  • Provide written and verbal updates to executives and the stakeholders of the application in the organization.
  • Understand the current process, system setup and propose the improvements needed in the processes, and technology so that the application exceeds the desired Service Level Objective.
  • Strong believer of automation to bring in sustained continuous improvement by automating Toil, Runbooks, improving ability of the applications to auto heal leading to improved reliability

Must Have Skills :

  • The successful candidate will have the following attributes/qualifications :
  • 15+ years of experience in Development and Operations of applications/services in production that has uptime over 99.9%.
  • 8+ years of experience as a SRE in handling applications that are web scale
  • Strong hands-on coding experience in one or more programming languages such as Python, Golang, Java, Bash, etc.
  • Good understanding of Observability (monitoring, logging, tracing, metrics), Chaos engineering concepts.
  • Proficiency in using Observability tools (example : New Relic, Datadog, etc) for monitoring, logging, tracing.
  • Expert level hands on knowledge in public cloud platform AWS and/or Google Cloud Platform. Professional level certificate on one of the public clouds is highly desirable.
  • Must have hands-on experience in using configuration management systems such as Ansible or SaltStack and infrastructure automation tools like Terraform or CloudFormation.
  • Should have used altering systems such as Pager Duty.
  • Should have implemented solutions around Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for services.
  • Measurement should have been within a system and across systems in distributed systems
  • Should have supported Production Incidents (PIs) on critical applications of a company. Troubleshoot, debug, and diagnose operational issues and drive them to closure.
  • Understanding of software delivery life cycles, particularly Agile/Lean & DevOps - Proven experience in handling large scale and growing infrastructure across Data Centers and heterogeneous Cloud platforms
  • Experience as a service owner in managing large geographically diverse stakeholders
  • Ability to work with creative fast growing engineering team and motivate them to deliver their best work
  • History of driving innovation.

Good to Have Skills :

  • Familiarity with handling :
  • Containerization Kubernetes, Docker, Rancher, etc
  • Kafka, Yarn, ElasticSearch etc.
  • Source code management and Implementation of Security best practices.
  • Tech Stack - Python, Falcon, Elastic Search, MongoDB, AWS (SQS S3), Map Reduce.
  • Networking knowledge
  • Understanding of software delivery life cycles, particularly Agile/Lean & DevOps
  • Contribution to open source community
  • Qualification :Masters or Bachelors degree in Computer Science Engineering, or a related technical degree.

More Info

Industry:
Function:
Employment Type:
Open to candidates from:
Indian

About Company

We are Nomiso a software co-engineering company and your partner in solving complex business problems with technology. We collaborate with our clients to understand business- and industry-specific challenges and engage with them to push the boundaries of what's possible to deliver impact at pace and scale.
At Nomiso, we are not just delivering innovative solutions; we are Co-engineering Excellence.

Job ID: 120558757