We are seeking a Cloud Operations Lead to support a leading IT R&D organization in Kolkata. This role ensures the stability, performance, and security of cloud-based systems while driving operational excellence through proactive monitoring, incident management, automation, and capacity planning. You will lead cross-functional teams, optimize cloud resources for cost efficiency, and champion automation to reduce manual effort and improve reliability.
Key Responsibilities
Cloud Operations & Reliability
- Manage day-to-day operations across production, staging, and development cloud environments within an R&D context.
- Ensure high availability of services through robust monitoring, alerting, and incident response processes.
- Lead root cause analyses (RCA) and post-mortem reviews to drive continuous improvement.
- Implement observability practices including logging, tracing, and metrics for proactive issue detection.
- Oversee patch management and maintenance to ensure systems remain secure and up-to-date.
Automation & Optimization
- Develop and maintain automation scripts for provisioning, scaling, and monitoring cloud resources.
- Optimize cloud usage through rightsizing, reserved instances, and cost governance (FinOps).
- Standardize operational runbooks and playbooks to streamline processes and reduce manual effort.
Security & Compliance
- Enforce security baselines, including IAM, encryption, and network segmentation across cloud services.
- Collaborate with security teams to implement cloud-native security tools and respond to threats.
- Ensure compliance with regulatory standards and audits (SOC 2, ISO 27001, GDPR, HIPAA where applicable).
Team Leadership & Collaboration
- Lead, mentor, and develop a team of cloud operations engineers.
- Promote a culture of SRE/DevOps best practices, automation, and operational reliability.
- Partner with application, DevOps, and networking teams to support business-critical R&D initiatives.
- Act as escalation point for critical incidents and operational challenges.
Vendor & Stakeholder Management
- Manage relationships with cloud providers (AWS, Azure, GCP) and monitoring tool vendors.
- Provide operational metrics and status updates to senior leadership.
- Collaborate with finance to align cloud cost forecasts and budget planning.
Required Qualifications
Education & Experience
- Bachelor's degree in Computer Science, IT, or a related field.
- 58 years of experience in cloud operations, SRE, or IT infrastructure.
- 2+ years in a leadership role managing operational teams, preferably in an R&D environment.
Technical Skills
- Expertise in at least one major cloud platform (AWS, Azure, GCP).
- Hands-on experience with monitoring and observability tools (CloudWatch, Datadog, New Relic, Prometheus).
- Strong knowledge of Infrastructure as Code (Terraform, CloudFormation, ARM templates).
- Experience with incident management frameworks (ITIL, SRE principles, PagerDuty/On-Call rotations).
- Familiarity with container orchestration (Kubernetes, ECS, AKS, GKE) and CI/CD pipelines.
- Understanding of cloud security best practices and compliance frameworks.
Soft Skills
- Proven ability to lead and inspire teams in a fast-paced R&D environment.
- Strong problem-solving, decision-making, and communication skills.
- Collaborative mindset to work effectively with technical and business stakeholders.
Preferred Qualifications
- Cloud certifications (AWS SysOps, Azure Administrator, Google Cloud DevOps Engineer, or equivalent).
- Experience managing multi-cloud environments.
- Knowledge of FinOps and cost governance frameworks.
- Familiarity with ITIL processes or formal service management frameworks.
Key Success Metrics
- System Uptime: Meet or exceed availability SLAs (>99.9%).
- Incident Response: Reduced MTTR (Mean Time to Resolution) for critical incidents.
- Cost Efficiency: Optimize resource utilization and achieve measurable cloud cost savings.
- Automation: Increase automation coverage for operational tasks year over year.
- Team Performance: Maintain high team engagement and development.