Devops and Site Reliability Engineer
Experience : 2-5 Years
Location: Onsite -Hybrid
About Lyzr
At Lyzr, we aren't just building infrastructure; we're architecting the backbone of the GenAI revolution. We are looking for a Cloud Reliability & DevOps Architect who thrives at the intersection of automation and operational excellence.
Role Overview
We are looking for a high-agility Cloud Reliability & DevOps Architect to join our engineering team. This is a hybrid role designed for a professional who sits at the intersection of infrastructure automation (DevOps) and operational excellence (SRE).
You will be responsible for architecting resilient multi-cloud environments, automating complex delivery pipelines, and ensuring the absolute reliability and cost-efficiency of our production systems. From writing modular Terraform code to leading deep-dive Root Cause Analysis (RCA), you will own the entire lifecycle of our infrastructure.
Key Responsibilities
- IaC & Automation Architecture
- Advanced Development: Architect and maintain complex infrastructure using Terraform (multi-cloud) and AWS CloudFormation.
- Modular Design: Create reusable, version-controlled modules to standardize deployments and eliminate code duplication.
- Eliminate Toil: Apply SRE principles to automate repetitive operational tasks and manual provisioning through Python, Bash, or Go.
- Multi-Cloud Operations & Connectivity
- Core Management: Optimize production environments across AWS (EC2, EKS, Lambda, VPC) and Azure (VMs, VNet, Functions).
- Cross-Cloud Networking: Design secure connectivity solutions between disparate cloud providers and on-premise systems.
- System Reliability & Observability
- End-to-End Ownership: Own the health of production systems, ensuring High Availability (HA) and meeting strict SLOs/SLIs.
- Incident Management: Lead the RCA process for outages and implement architectural changes to prevent recurrence.
- Observability Frameworks: Build and maintain comprehensive monitoring and alerting (Prometheus, Grafana, ELK Stack, CloudWatch) for early anomaly detection.
- Security, Compliance & FinOps
- Security by Design: Build infrastructure with strict IAM roles, secret management (HashiCorp Vault/KMS), and automated compliance checks (SOC2/ISO).
- Cost Optimization: Actively drive FinOps initiativesrightsizing instances, managing Reserved/Spot instances, and identifying idle resources to reduce waste.
- Disaster Recovery: Design and lead periodic DR failover drills to ensure business continuity.
- CI/CD & Performance Tuning
- Pipeline Ownership: Design end-to-end CI/CD pipelines (GitHub Actions, GitLab CI, or Jenkins) for seamless delivery.
- Self-Healing Systems: Implement auto-remediation workflows to resolve common system issues without human intervention.
Technical Qualifications
Must-Have Skills:
- Experience: 25 years in SRE, DevOps, or Cloud Engineering roles.
- Cloud Mastery: Hands-on experience managing production workloads in AWS (Expert level) and Azure.
- IaC Proficiency: Expert-level knowledge of Terraform (State management, Modules) and CloudFormation.
- Scripting: Strong automation skills in Python and Bash.
- Monitoring: Hands-on experience with Grafana, Prometheus, or Datadog.
Preferred Qualifications
- Containers: Experience with Kubernetes (EKS/AKS) and orchestration.
- Certifications: HashiCorp Certified: Terraform Associate or AWS/Azure DevOps Professional.
- Data: Understanding of database administration (PostgreSQL, MySQL, or DynamoDB).
Work Environment & Soft Skills
- Global Flexibility: We support clients across IST, GMT, and EST. You must be flexible with working hours for deployments and on-call rotations.
- Detective Mindset: You are relentless in debugging and won't stop until you find the root cause of a distributed system issue.
- Financial Awareness: You treat cloud resources as real money and take pride in running a lean, efficient infrastructure.
- Tech Agility: You are not married to one tool; you use the best tool for the job and pivot as technology evolves.