Site Reliability Architect - Configuration Management

Nomiso

Remote

15-17 Years

Save

Posted 3 days ago
Over 50 applicants

Quick Apply

Job Description

Roles and Responsibilities :

Own the Infrastructure, APM and work with Developers and Systems engineers to Build, Release, Monitor and run the services reliability exceeding the agreed SLAs.
Write software to automate API-driven tasks at scale and contribute to the product codebase in Java, JS, React, Node, Go and Python
Write automation to reduce toil and eliminate manual tasks that are repeatable.
Work with Ansible, Puppet, Chef, Terraform or another config management / orchestration suite, know where it's broken, work towards fixing them and explore new alternatives
Define and accelerate implementation of support processes, tools and best practices
Maintain services once they are live by measuring and monitoring availability, latency and overall system reliability
Handle cross team performance issues from identification of the cause, determining the areas of improvement and driving those actions to closure
Performance and maturity baselining of Systems, tools maturity & coverage, metrics, technology and engineering practices
Define, Measure and improve Reliability Metrics (SLO/SLI), Observability (Monitoring, Logging-Tracing solutions), Ops process (Incident, Problem Mgmt) and streamline automate release management.
Build dashboards to provide visibility into performance of the applications.
Create chaos in the production environment purposefully in a controlled manager to validate reliability of systems.
Mentor and coach other SREs in the organization
Provide written and verbal updates to executives and the stakeholders of the application in the organization.
Understand the current process, system setup and propose the improvements needed in the processes, and technology so that the application exceeds the desired Service Level Objective.
Strong believer of automation to bring in sustained continuous improvement by automating Toil, Runbooks, improving ability of the applications to auto heal leading to improved reliability

Must Have Skills :

The successful candidate will have the following attributes/qualifications :
15+ years of experience in Development and Operations of applications/services in production that has uptime over 99.9%.
8+ years of experience as a SRE in handling applications that are web scale
Strong hands-on coding experience in one or more programming languages such as Python, Golang, Java, Bash, etc.
Good understanding of Observability (monitoring, logging, tracing, metrics), Chaos engineering concepts.
Proficiency in using Observability tools (example : New Relic, Datadog, etc) for monitoring, logging, tracing.
Expert level hands on knowledge in public cloud platform AWS and/or Google Cloud Platform. Professional level certificate on one of the public clouds is highly desirable.
Must have hands-on experience in using configuration management systems such as Ansible or SaltStack and infrastructure automation tools like Terraform or CloudFormation.
Should have used altering systems such as Pager Duty.
Should have implemented solutions around Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for services.
Measurement should have been within a system and across systems in distributed systems
Should have supported Production Incidents (PIs) on critical applications of a company. Troubleshoot, debug, and diagnose operational issues and drive them to closure.
Understanding of software delivery life cycles, particularly Agile/Lean & DevOps - Proven experience in handling large scale and growing infrastructure across Data Centers and heterogeneous Cloud platforms
Experience as a service owner in managing large geographically diverse stakeholders
Ability to work with creative fast growing engineering team and motivate them to deliver their best work
History of driving innovation.

Good to Have Skills :

Familiarity with handling :
Containerization Kubernetes, Docker, Rancher, etc
Kafka, Yarn, ElasticSearch etc.
Source code management and Implementation of Security best practices.
Tech Stack - Python, Falcon, Elastic Search, MongoDB, AWS (SQS S3), Map Reduce.
Networking knowledge
Understanding of software delivery life cycles, particularly Agile/Lean & DevOps
Contribution to open source community
Qualification :Masters or Bachelors degree in Computer Science Engineering, or a related technical degree.

More Info

Job Type:

Permanent Job, Work From Home

Industry:

Software

Role:

Software Engineer /Programmer

Function:

Employment Type:

Full time

Open to candidates from:

Indian

About Company

Nomiso

We are Nomiso a software co-engineering company and your partner in solving complex business problems with technology. We collaborate with our clients to understand business- and industry-specific challenges and engage with them to push the boundaries of what's possible to deliver impact at pace and scale.
At Nomiso, we are not just delivering innovative solutions; we are Co-engineering Excellence.

Job ID: 120558757

Jobs by Skill - IT

Jobs by Skill - Non IT

International Jobs

Last Updated: 18-03-2026 04:09:30 AM

Homejobs in RemoteSite Reliability Architect - Configuration Management

Do you want to see more relevant and perfect job for you?

Beware of Scammers

We don’t charge any money for job offers

What it feels like to have

48% more interview calls?

To get 5X more recruiter views on your profile