About the Role
We are looking for a highly skilled DevOps Engineer / Site Reliability Engineer (SRE) to build, manage, and scale reliable cloud infrastructure and deployment systems across our platforms. This role requires hands-on ownership of AWS environments, CI/CD pipelines, observability, incident response, security, and platform reliability.
You will work closely with engineering teams to build scalable infrastructure, improve deployment efficiency, strengthen security practices, and ensure high availability across production systems.
Key Responsibilities
CI/CD & Release Engineering
- Design, build, and manage end-to-end CI/CD pipelines from source commit to production deployment
- Implement automated testing, security gates, and controlled rollout strategies
- Standardize release processes including versioning, approvals, rollback triggers, and artifact management
- Implement blue/green and canary deployment strategies
- Integrate security scanning into deployment pipelines (SAST, DAST, dependency scans, secrets detection)
AWS Infrastructure & Platform Operations
- Provision and manage AWS environments across Dev, QA, Staging, and Production
- Build and maintain Infrastructure as Code using Terraform (preferred)
- Manage IAM, networking, secrets management, cost tagging, and configuration management
- Ensure environment parity across staging and production systems
Observability & Reliability Engineering
- Implement monitoring and alerting using Datadog and/or AWS CloudWatch
- Build dashboards, alerts, logging, and tracing systems
- Define and manage SLIs, SLOs, error budgets, and reliability standards
- Create runbooks, incident response documentation, and RCA processes
- Participate in production support and incident management
Backup, BCM & Disaster Recovery
- Manage backup strategies, restore drills, retention policies, and access controls
- Design and test Business Continuity and Disaster Recovery (BCM/DR) plans
- Define and maintain RTO/RPO standards and failover procedures
Security & DevSecOps
- Integrate security into the complete deployment lifecycle
- Implement least-privilege IAM and secure secrets management practices
- Improve platform security without impacting deployment velocity
- Ensure audit-friendly infrastructure and deployment standards
Required Skills & Experience
Must Have
- Strong hands-on AWS production experience (Compute, Networking, IAM, Monitoring)
- End-to-end CI/CD pipeline ownership and implementation
- Infrastructure as Code experience (Terraform preferred)
- Monitoring & observability using Datadog and/or CloudWatch
- Experience with SLI/SLO operations, RCA, and incident management
- Backup, restore, BCM, and DR planning experience
- Strong Linux and scripting skills (Bash and/or Python)
Good to Have
- Docker, ECS, EKS, or Kubernetes experience
- Blue/green deployments, canary releases, and rollback strategies
- Secrets management using AWS Secrets Manager / Parameter Store
- APM, distributed tracing, and centralized logging
- FinOps / AWS cost optimization exposure
- Experience with GitLab CI, Jenkins, or GitHub Actions
- Datadog advanced features (APM, RUM, synthetics)
- Experience working in regulated or high-availability environments
Experience & Qualification
- 6–10+ years of experience in DevOps / SRE / Cloud Operations / Platform Engineering
- B.E. / B.Tech / B.Sc. in Engineering, Computer Science, or equivalent practical experience
What You'll Work On
- Building scalable CI/CD and deployment systems
- Improving platform reliability and operational standards
- Managing AWS infrastructure and observability systems
- Strengthening security and automation practices
- Supporting production systems and incident response
- Driving infrastructure standardization across teams