Search by job, company or skills

Lyzr AI

Devops and Site Reliability Engineer

new job description bg glownew job description bg glownew job description bg svg
  • Posted 10 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Devops and Site Reliability Engineer

Experience : 2-5 Years

Location: Onsite -Hybrid

About Lyzr

At Lyzr, we aren't just building infrastructure; we're architecting the backbone of the GenAI revolution. We are looking for a Cloud Reliability & DevOps Architect who thrives at the intersection of automation and operational excellence.

Role Overview

We are looking for a high-agility Cloud Reliability & DevOps Architect to join our engineering team. This is a hybrid role designed for a professional who sits at the intersection of infrastructure automation (DevOps) and operational excellence (SRE).

You will be responsible for architecting resilient multi-cloud environments, automating complex delivery pipelines, and ensuring the absolute reliability and cost-efficiency of our production systems. From writing modular Terraform code to leading deep-dive Root Cause Analysis (RCA), you will own the entire lifecycle of our infrastructure.

Key Responsibilities

  • IaC & Automation Architecture
  • Advanced Development: Architect and maintain complex infrastructure using Terraform (multi-cloud) and AWS CloudFormation.
  • Modular Design: Create reusable, version-controlled modules to standardize deployments and eliminate code duplication.
  • Eliminate Toil: Apply SRE principles to automate repetitive operational tasks and manual provisioning through Python, Bash, or Go.
  • Multi-Cloud Operations & Connectivity
  • Core Management: Optimize production environments across AWS (EC2, EKS, Lambda, VPC) and Azure (VMs, VNet, Functions).
  • Cross-Cloud Networking: Design secure connectivity solutions between disparate cloud providers and on-premise systems.
  • System Reliability & Observability
  • End-to-End Ownership: Own the health of production systems, ensuring High Availability (HA) and meeting strict SLOs/SLIs.
  • Incident Management: Lead the RCA process for outages and implement architectural changes to prevent recurrence.
  • Observability Frameworks: Build and maintain comprehensive monitoring and alerting (Prometheus, Grafana, ELK Stack, CloudWatch) for early anomaly detection.
  • Security, Compliance & FinOps
  • Security by Design: Build infrastructure with strict IAM roles, secret management (HashiCorp Vault/KMS), and automated compliance checks (SOC2/ISO).
  • Cost Optimization: Actively drive FinOps initiativesrightsizing instances, managing Reserved/Spot instances, and identifying idle resources to reduce waste.
  • Disaster Recovery: Design and lead periodic DR failover drills to ensure business continuity.
  • CI/CD & Performance Tuning
  • Pipeline Ownership: Design end-to-end CI/CD pipelines (GitHub Actions, GitLab CI, or Jenkins) for seamless delivery.
  • Self-Healing Systems: Implement auto-remediation workflows to resolve common system issues without human intervention.

Technical Qualifications

Must-Have Skills:

  • Experience: 25 years in SRE, DevOps, or Cloud Engineering roles.
  • Cloud Mastery: Hands-on experience managing production workloads in AWS (Expert level) and Azure.
  • IaC Proficiency: Expert-level knowledge of Terraform (State management, Modules) and CloudFormation.
  • Scripting: Strong automation skills in Python and Bash.
  • Monitoring: Hands-on experience with Grafana, Prometheus, or Datadog.

Preferred Qualifications

  • Containers: Experience with Kubernetes (EKS/AKS) and orchestration.
  • Certifications: HashiCorp Certified: Terraform Associate or AWS/Azure DevOps Professional.
  • Data: Understanding of database administration (PostgreSQL, MySQL, or DynamoDB).

Work Environment & Soft Skills

  • Global Flexibility: We support clients across IST, GMT, and EST. You must be flexible with working hours for deployments and on-call rotations.
  • Detective Mindset: You are relentless in debugging and won't stop until you find the root cause of a distributed system issue.
  • Financial Awareness: You treat cloud resources as real money and take pride in running a lean, efficient infrastructure.
  • Tech Agility: You are not married to one tool; you use the best tool for the job and pivot as technology evolves.

More Info

About Company

Job ID: 144631693