Devops and Site Reliability Engineer

Lyzr AI

Bengaluru, India

2-5 Years

Save

Posted 5 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Devops and Site Reliability Engineer

Experience : 2-5 Years

Location: Onsite -Hybrid

About Lyzr

At Lyzr, we aren't just building infrastructure; we're architecting the backbone of the GenAI revolution. We are looking for a Cloud Reliability & DevOps Architect who thrives at the intersection of automation and operational excellence.

Role Overview

We are looking for a high-agility Cloud Reliability & DevOps Architect to join our engineering team. This is a hybrid role designed for a professional who sits at the intersection of infrastructure automation (DevOps) and operational excellence (SRE).

You will be responsible for architecting resilient multi-cloud environments, automating complex delivery pipelines, and ensuring the absolute reliability and cost-efficiency of our production systems. From writing modular Terraform code to leading deep-dive Root Cause Analysis (RCA), you will own the entire lifecycle of our infrastructure.

Key Responsibilities

IaC & Automation Architecture
Advanced Development: Architect and maintain complex infrastructure using Terraform (multi-cloud) and AWS CloudFormation.
Modular Design: Create reusable, version-controlled modules to standardize deployments and eliminate code duplication.
Eliminate Toil: Apply SRE principles to automate repetitive operational tasks and manual provisioning through Python, Bash, or Go.
Multi-Cloud Operations & Connectivity
Core Management: Optimize production environments across AWS (EC2, EKS, Lambda, VPC) and Azure (VMs, VNet, Functions).
Cross-Cloud Networking: Design secure connectivity solutions between disparate cloud providers and on-premise systems.
System Reliability & Observability
End-to-End Ownership: Own the health of production systems, ensuring High Availability (HA) and meeting strict SLOs/SLIs.
Incident Management: Lead the RCA process for outages and implement architectural changes to prevent recurrence.
Observability Frameworks: Build and maintain comprehensive monitoring and alerting (Prometheus, Grafana, ELK Stack, CloudWatch) for early anomaly detection.
Security, Compliance & FinOps
Security by Design: Build infrastructure with strict IAM roles, secret management (HashiCorp Vault/KMS), and automated compliance checks (SOC2/ISO).
Cost Optimization: Actively drive FinOps initiativesrightsizing instances, managing Reserved/Spot instances, and identifying idle resources to reduce waste.
Disaster Recovery: Design and lead periodic DR failover drills to ensure business continuity.
CI/CD & Performance Tuning
Pipeline Ownership: Design end-to-end CI/CD pipelines (GitHub Actions, GitLab CI, or Jenkins) for seamless delivery.
Self-Healing Systems: Implement auto-remediation workflows to resolve common system issues without human intervention.

Technical Qualifications

Must-Have Skills:

Experience: 25 years in SRE, DevOps, or Cloud Engineering roles.
Cloud Mastery: Hands-on experience managing production workloads in AWS (Expert level) and Azure.
IaC Proficiency: Expert-level knowledge of Terraform (State management, Modules) and CloudFormation.
Scripting: Strong automation skills in Python and Bash.
Monitoring: Hands-on experience with Grafana, Prometheus, or Datadog.

Preferred Qualifications

Containers: Experience with Kubernetes (EKS/AKS) and orchestration.
Certifications: HashiCorp Certified: Terraform Associate or AWS/Azure DevOps Professional.
Data: Understanding of database administration (PostgreSQL, MySQL, or DynamoDB).

Work Environment & Soft Skills

Global Flexibility: We support clients across IST, GMT, and EST. You must be flexible with working hours for deployments and on-call rotations.
Detective Mindset: You are relentless in debugging and won't stop until you find the root cause of a distributed system issue.
Financial Awareness: You treat cloud resources as real money and take pride in running a lean, efficient infrastructure.
Tech Agility: You are not married to one tool; you use the best tool for the job and pivot as technology evolves.