
Search by job, company or skills
We are seeking a highly skilled Infrastructure Reliability & Operations Engineer with strong private cloud experience and a minimum of 5 years in infrastructure reliability, operations, or site reliability engineering. The ideal candidate will be responsible for designing, implementing, and maintaining fault-tolerant infrastructure while driving automation, observability, and reliability across mission-critical systems.
You will collaborate with DevOps, development, and security teams to ensure seamless deployments, optimize performance, and uphold the highest standards of security and compliance. This role requires a proactive mindset, technical expertise, and a passion for building resilient systems.
Key Responsibilities:
Design and maintain highly available, scalable, and secure infrastructure
Lead incident response, root cause analysis, and post-incident reviews
Develop automation tools and apply Infrastructure as Code (Terraform, Ansible, CloudFormation)
Build self-healing systems and streamline operational workflows
Support CI/CD pipelines and containerized platforms (Docker, Kubernetes, OpenShift)
Implement monitoring, logging, and alerting systems (Prometheus, Grafana, ELK, Datadog)
Define and track SLIs, SLOs, and SLAs for system reliability
Collaborate with security teams on vulnerability management and compliance
Required Skills & Qualifications
Strong experience in Linux/Unix system administration
Proficiency in Python, Go, Bash, Shell or similar scripting languages
Hands-on experience with AWS, Azure, or GCP
Expertise in containerization & orchestration technologies
Solid understanding of networking concepts (DNS, TCP/IP, load balancing, firewalls)
Experience with monitoring, logging, and alerting tools
Job ID: 144184433