Cloud Operations & Reliability Lead

elevarae

Kolkata, India

5-8 Years

This job is no longer accepting applications

Posted 2 months ago

Job Description

We are seeking a Cloud Operations Lead to support a leading IT R&D organization in Kolkata. This role ensures the stability, performance, and security of cloud-based systems while driving operational excellence through proactive monitoring, incident management, automation, and capacity planning. You will lead cross-functional teams, optimize cloud resources for cost efficiency, and champion automation to reduce manual effort and improve reliability.

Key Responsibilities

Cloud Operations & Reliability

Manage day-to-day operations across production, staging, and development cloud environments within an R&D context.
Ensure high availability of services through robust monitoring, alerting, and incident response processes.
Lead root cause analyses (RCA) and post-mortem reviews to drive continuous improvement.
Implement observability practices including logging, tracing, and metrics for proactive issue detection.
Oversee patch management and maintenance to ensure systems remain secure and up-to-date.

Automation & Optimization

Develop and maintain automation scripts for provisioning, scaling, and monitoring cloud resources.
Optimize cloud usage through rightsizing, reserved instances, and cost governance (FinOps).
Standardize operational runbooks and playbooks to streamline processes and reduce manual effort.

Security & Compliance

Enforce security baselines, including IAM, encryption, and network segmentation across cloud services.
Collaborate with security teams to implement cloud-native security tools and respond to threats.
Ensure compliance with regulatory standards and audits (SOC 2, ISO 27001, GDPR, HIPAA where applicable).

Team Leadership & Collaboration

Lead, mentor, and develop a team of cloud operations engineers.
Promote a culture of SRE/DevOps best practices, automation, and operational reliability.
Partner with application, DevOps, and networking teams to support business-critical R&D initiatives.
Act as escalation point for critical incidents and operational challenges.

Vendor & Stakeholder Management

Manage relationships with cloud providers (AWS, Azure, GCP) and monitoring tool vendors.
Provide operational metrics and status updates to senior leadership.
Collaborate with finance to align cloud cost forecasts and budget planning.

Required Qualifications

Education & Experience

Bachelor's degree in Computer Science, IT, or a related field.
58 years of experience in cloud operations, SRE, or IT infrastructure.
2+ years in a leadership role managing operational teams, preferably in an R&D environment.

Technical Skills

Expertise in at least one major cloud platform (AWS, Azure, GCP).
Hands-on experience with monitoring and observability tools (CloudWatch, Datadog, New Relic, Prometheus).
Strong knowledge of Infrastructure as Code (Terraform, CloudFormation, ARM templates).
Experience with incident management frameworks (ITIL, SRE principles, PagerDuty/On-Call rotations).
Familiarity with container orchestration (Kubernetes, ECS, AKS, GKE) and CI/CD pipelines.
Understanding of cloud security best practices and compliance frameworks.

Soft Skills

Proven ability to lead and inspire teams in a fast-paced R&D environment.
Strong problem-solving, decision-making, and communication skills.
Collaborative mindset to work effectively with technical and business stakeholders.

Preferred Qualifications

Cloud certifications (AWS SysOps, Azure Administrator, Google Cloud DevOps Engineer, or equivalent).
Experience managing multi-cloud environments.
Knowledge of FinOps and cost governance frameworks.
Familiarity with ITIL processes or formal service management frameworks.

Key Success Metrics

System Uptime: Meet or exceed availability SLAs (>99.9%).
Incident Response: Reduced MTTR (Mean Time to Resolution) for critical incidents.
Cost Efficiency: Optimize resource utilization and achieve measurable cloud cost savings.
Automation: Increase automation coverage for operational tasks year over year.
Team Performance: Maintain high team engagement and development.