Senior Staff Site Reliability Engineer

Movius

Bengaluru, India

8-10 Years

Save

Posted an hour ago
Be among the first 10 applicants

Early Applicant

Job Description

Job Description

Job Title: Senior Staff Site Reliability Engineer

Location: Bangalore

About Movius

At Movius, we solve a critical gap companies face with employee-to-client communication over voice and messaging. We are the leading global provider of Secure Communication as a Service (SCaaS). Our flagship solution, MultiLine, enhances workflows, resolves compliance gaps and unifies cross-channel messaging. Movius AI-powered solutions enable businesses to build strong and lasting relationships with their customers in a company-owned, controllable system. Welcome to Phone 3.0.

Headquartered in Alpharetta, GA, with offices in Silicon Valley, Bangalore, India, New York, and London, Movius partners with leading global wireless carriers like T-Mobile, Vodafone, TELUS, BT, Singtel & more. To learn more about Movius, visit www.movius.ai .

Your Opportunity

We are looking for a Senior Staff Site Reliability Engineer (SRE) with strong technical expertise in distributed systems, cloud infrastructure, observability, and automation.

In this role, you will be responsible for improving the reliability, scalability, and performance of our production and pre-production systems. You will work hands-on in designing and implementing SRE frameworks, automating key reliability workflows, and building a culture of operational excellence.

You will also work closely with product engineering, QA, and DevOps teams to define SLOs/SLIs, enhance monitoring and alerting, and strengthen our overall reliability practices.

What You'll Do

Reliability Engineering & Architecture
Design and maintain highly available, fault-tolerant systems on AWS.
Implement service reliability models based on SLOs, SLIs, and error budgets.
Continuously improve system performance, scalability, and resilience.
Automation & Infrastructure-as-Code (IaC)
Build and maintain automation pipelines using Terraform, Ansible, Bitbucket, and Jenkins.
Develop reusable IaC modules for multi-account and multi-environment AWS setups.
Automate operational processes for provisioning, scaling, monitoring, and recovery.
Observability & Monitoring
Define observability standards and create dashboards using Elastic Stack, Grafana, or Prometheus.
Implement intelligent alerting using AIOps and anomaly detection tools.
Work with development teams to ensure proper telemetry and trace coverage.
Incident Management & RCA
Lead major incident response and ensure quick service restoration.
Conduct blameless post-incident reviews and implement preventive actions.
Create and maintain runbooks, escalation matrices, and reliability playbooks.
Performance & Capacity Planning
Analyse performance bottlenecks and propose tuning or optimization strategies.
Lead capacity forecasting and ensure the system can handle growth demands.
Collaboration & Mentorship
Partner with development, QA, and DevOps teams to embed SRE principles.
Coach and mentor junior engineers on reliability engineering and automation.
Documentation & Knowledge Management
Maintain detailed architecture diagrams, design documents, and operational procedures.
Document SLOs, automation workflows, and change management reports.
Technical Leadership
Lead technical discussions, reliability reviews, and performance retrospectives.
Promote a code-driven, automation-first reliability culture across teams.

What You Bring

Education

Bachelor's degree in Computer Science, Information Technology, or equivalent experience.

Experience

8+ years in SRE or DevOps roles managing large-scale distributed systems.
Proven hands-on experience in cloud operations (AWS preferred), automation, and CI/CD pipelines.
Experience in the Telecom domain is an added advantage.

Technical Skills

Deep knowledge of AWS (EKS, EC2, RDS, IAM, VPC, Kafka, CloudWatch, API Gateway, Lambda, WAF, KMS).
Strong Linux administration and networking fundamentals.
Skilled in Terraform, Jenkins, Git, and scripting (Python, Bash).
Solid understanding of observability tools (Grafana, Elastic Stack, Prometheus).
Experience with container orchestration (Kubernetes) and microservices-based systems.

Certifications (Preferred)