Search by job, company or skills

elevarae

Cloud Operations & Reliability Lead

new job description bg glownew job description bg glownew job description bg svg
  • Posted 7 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

We are seeking a Cloud Operations Lead to support a leading IT R&D organization in Kolkata. This role ensures the stability, performance, and security of cloud-based systems while driving operational excellence through proactive monitoring, incident management, automation, and capacity planning. You will lead cross-functional teams, optimize cloud resources for cost efficiency, and champion automation to reduce manual effort and improve reliability.

Key Responsibilities

Cloud Operations & Reliability

  • Manage day-to-day operations across production, staging, and development cloud environments within an R&D context.
  • Ensure high availability of services through robust monitoring, alerting, and incident response processes.
  • Lead root cause analyses (RCA) and post-mortem reviews to drive continuous improvement.
  • Implement observability practices including logging, tracing, and metrics for proactive issue detection.
  • Oversee patch management and maintenance to ensure systems remain secure and up-to-date.

Automation & Optimization

  • Develop and maintain automation scripts for provisioning, scaling, and monitoring cloud resources.
  • Optimize cloud usage through rightsizing, reserved instances, and cost governance (FinOps).
  • Standardize operational runbooks and playbooks to streamline processes and reduce manual effort.

Security & Compliance

  • Enforce security baselines, including IAM, encryption, and network segmentation across cloud services.
  • Collaborate with security teams to implement cloud-native security tools and respond to threats.
  • Ensure compliance with regulatory standards and audits (SOC 2, ISO 27001, GDPR, HIPAA where applicable).

Team Leadership & Collaboration

  • Lead, mentor, and develop a team of cloud operations engineers.
  • Promote a culture of SRE/DevOps best practices, automation, and operational reliability.
  • Partner with application, DevOps, and networking teams to support business-critical R&D initiatives.
  • Act as escalation point for critical incidents and operational challenges.

Vendor & Stakeholder Management

  • Manage relationships with cloud providers (AWS, Azure, GCP) and monitoring tool vendors.
  • Provide operational metrics and status updates to senior leadership.
  • Collaborate with finance to align cloud cost forecasts and budget planning.

Required Qualifications

Education & Experience

  • Bachelor's degree in Computer Science, IT, or a related field.
  • 58 years of experience in cloud operations, SRE, or IT infrastructure.
  • 2+ years in a leadership role managing operational teams, preferably in an R&D environment.

Technical Skills

  • Expertise in at least one major cloud platform (AWS, Azure, GCP).
  • Hands-on experience with monitoring and observability tools (CloudWatch, Datadog, New Relic, Prometheus).
  • Strong knowledge of Infrastructure as Code (Terraform, CloudFormation, ARM templates).
  • Experience with incident management frameworks (ITIL, SRE principles, PagerDuty/On-Call rotations).
  • Familiarity with container orchestration (Kubernetes, ECS, AKS, GKE) and CI/CD pipelines.
  • Understanding of cloud security best practices and compliance frameworks.

Soft Skills

  • Proven ability to lead and inspire teams in a fast-paced R&D environment.
  • Strong problem-solving, decision-making, and communication skills.
  • Collaborative mindset to work effectively with technical and business stakeholders.

Preferred Qualifications

  • Cloud certifications (AWS SysOps, Azure Administrator, Google Cloud DevOps Engineer, or equivalent).
  • Experience managing multi-cloud environments.
  • Knowledge of FinOps and cost governance frameworks.
  • Familiarity with ITIL processes or formal service management frameworks.

Key Success Metrics

  • System Uptime: Meet or exceed availability SLAs (>99.9%).
  • Incident Response: Reduced MTTR (Mean Time to Resolution) for critical incidents.
  • Cost Efficiency: Optimize resource utilization and achieve measurable cloud cost savings.
  • Automation: Increase automation coverage for operational tasks year over year.
  • Team Performance: Maintain high team engagement and development.

More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 136721127