Search by job, company or skills

Datavail

Senior Associate Cloud SRE

new job description bg glownew job description bg glownew job description bg svg
  • Posted a day ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Job Description

Job Title: Senior Associate Cloud SRE

Education: Any Graduate

Experience: 4 to 8 years

Location: Mumbai (Hybrid Model)

Employment Type: Full-time

Overview:

We are seeking a Site Reliability Engineer to deliver tier two cloud operations managed services support for AWS environments. This role combines advanced troubleshooting and operational excellence with proactive reliability engineering, focusing on maintaining 24x7x365 service availability while continuously improving automation and operational efficiency.

Role Summary:

As a Site Reliability Engineer supporting AWS infrastructure, you will manage complex operational challenges and escalations while implementing reliability best practices across production systems. You will work collaboratively with customer teams and senior engineers to ensure system stability, automate operational workflows, and maintain comprehensive observability. This is a delivery-focused role requiring both advanced technical execution and operational ownership.

Primary Responsibilities:

  • Tier 2 Cloud Operations & Managed Services
  • Provide 24x7x365 tier two support and escalation handling for AWS environments
  • Execute complex operational tasks including:
  • Patching and managing Amazon Machine Images (AMIs)
  • Creating and configuring EC2 instances and RDS databases
  • Managing IAM roles, users, and policies
  • Configuring S3 bucket policies and Access Control Lists (ACLs)
  • Opening and managing network routes
  • Restoring snapshots and database backups to lower environments
  • Increasing disk sizes and managing storage optimization
  • Implementing proper tagging for environment identification and cost allocation
  • Managing logs archiving and retention policies
  • Handle escalations from tier one support with deep technical analysis
  • Provide root cause analysis for complex incidents and recurring issues

Reliability & Incident Management:

  • Implement and maintain Service Level Indicators (SLIs) and Service Level Objectives (SLOs) in collaboration with senior engineers and customer stakeholders
  • Lead tier two incident response, performing advanced troubleshooting and resolution
  • Conduct thorough post-incident analysis with actionable remediation plans
  • Reduce reactive work by improving runbooks, alert configurations, and standard operating procedures
  • Apply reliability engineering best practices with oversight and review
  • Mentor tier one engineers during incident response

Automation & Infrastructure as Code:

  • Build and maintain CI/CD pipelines for infrastructure and application deployments
  • Automate complex operational tasks including patching, backups, and environment provisioning
  • Develop infrastructure automation using Terraform or equivalent IaC tools
  • Create sophisticated scripts and tooling to eliminate manual toil and improve operational efficiency
  • Follow established patterns and contribute continuous improvements
  • Document automation processes for knowledge sharing

Containerization & Deployment:

  • Deploy and operate containerized workloads using Docker on AWS services (ECS, EKS, or other managed container platforms)
  • Support container reliability through proper health checks, autoscaling configurations, and resource management
  • Implement safe deployment patterns (canary deployments, blue/green deployments)
  • Troubleshoot complex containerization and orchestration issues
  • Follow and enhance established containerization standards

Observability & Performance:

  • Configure and maintain comprehensive monitoring, logging, and alerting systems
  • Leverage observability data to identify issues and lead root cause analysis
  • Contribute to performance tuning and cost optimization initiatives
  • Ensure proper instrumentation and telemetry across AWS environments
  • Identify patterns and trends to prevent future incidents
  • Build custom dashboards and reports for operational insights

Collaboration & Customer Engagement:

  • Work closely with customer development and operations teams to improve system operability
  • Participate in design reviews and reliability assessments
  • Communicate technical concepts, tradeoffs, and recommendations clearly to stakeholders
  • Provide regular operational updates and service reports
  • Act as technical liaison between customers and internal engineering teams

Required Qualifications:

Experience:

  • 35 years of hands-on experience in DevOps, SRE, or production operations roles
  • Proven experience operating production systems in AWS environments
  • Demonstrated experience managing containerized applications in production
  • Experience delivering managed services or supporting customer-facing infrastructure
  • Track record of handling complex technical escalations

Technical Skills:

  • AWS Services: Strong working knowledge of EC2, RDS, S3, IAM, VPC, CloudWatch, and related services
  • Containerization: Hands-on experience with Docker and container orchestration platforms (ECS, EKS, or managed Kubernetes)
  • Infrastructure as Code: Proficiency with Terraform or equivalent tools
  • CI/CD: Experience building and maintaining automated deployment pipelines
  • Scripting/Programming: Proficiency in Python, Go, Bash, or similar languages
  • Monitoring & Logging: Experience with observability tools (CloudWatch, Datadog, Splunk, ELK, or similar)
  • Version Control: Proficiency with Git and collaborative development workflows
  • Troubleshooting: Advanced diagnostic and problem-solving capabilities

Operational Capabilities:

  • Experience with 24x7 operations and tier two escalation support
  • Strong troubleshooting and root cause analysis skills
  • Understanding of networking concepts, security best practices, and compliance requirements
  • Familiarity with backup/restore procedures and disaster recovery planning
  • Ability to work under pressure during critical incidents

Preferred Qualifications:

  • AWS certifications (Solutions Architect Associate, SysOps Administrator, or DevOps Engineer Professional)
  • Experience with Kubernetes in production environments
  • Prior consulting or managed services provider experience
  • Multi-cloud experience (Azure, AWS)
  • Experience with configuration management tools (Ansible, Chef, Puppet)
  • Knowledge of security and compliance frameworks (HIPAA, SOC 2, PCI-DSS)
  • Cloud-agnostic certifications (Terraform Associate, CKA, or SRE Foundation)
  • Experience in healthcare, finance, or other regulated industries

About Us

Datavail is a leading provider of data management, application development, analytics, and cloud services, with more than 1,000 professionals helping clients build and manage applications and data via a world-class tech-enabled delivery platform and software solutions across all leading technologies. For more than 17 years, Datavail has worked with thousands of companies spanning different industries and sizes, and is an AWS Advanced Tier Consulting Partner, a Microsoft Solutions Partner for Data & AI and Digital & App Innovation (Azure), an Oracle Partner, and a MySQL Partner.

About The Team

Datavail's Team of Cloud Experts Can Save You Time and Money

Our Cloud experts are capable to overcome every obstacle in helping clients manage everything from databases, analytics, reporting, migrations, and upgrades to monitoring and overall data management.

You can free up your IT resources to focus on growing your business rather than fighting fires. Our Cloud experts can guide you through strategic initiatives or support routine database management.

Cloud Managed Services

Datavail's business focuses on helping you use your data to drive business results through cost-saving services. The success of your business depends on how well you understand and manage your data. Our managed cloud services give you the power to unleash your organization's potential. We provide comprehensive and technically advanced support for Cloud Operation to ensure that your infrastructure is safe, secure, and managed with the utmost level of care.

Our delivery performance in data management leads the industry. We offer highly trained Cloud administrators via a 247, always on, always available, global delivery model.

With the combination of a proven delivery model and top-notch experience ensures that Datavail will remain the Cloud experts on demand you desire. Datavail's flexible and client focused services always add value to your organization.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 142269367