ROLE AND RESPONSIBILITIES:
A Site Reliability Engineer (SRE) is expected to own the operational stability and performance of hybrid cloud infrastructure (Nutanix, AWS/GCP). This involves leading automation efforts, architecting for reliability, and acting as the final escalation point for critical incidents to ensure the platform is scalable and efficient.
Nutanix Platform Management
- Design, deploy, and maintain enterprise-scale Nutanix AHV clusters and Prism Central for multi-cluster management
- Expert-level proficiency with Nutanix CLI (nCLI and acli) for advanced operations, troubleshooting, and automation
- Develop automation scripts using Nutanix REST APIs, Python SDK, PowerShell, and Terraform for infrastructure-as-code
- Create and manage VM templates, golden images, and standardized deployment catalogs for consistent provisioning
- Design disaster recovery solutions using Leap, Protection Domains, cross-cluster replication, and metro clustering
- Implement network micro-segmentation using Nutanix Flow and configure RBAC, encryption, and security hardening
- Lead L3 troubleshooting using advanced diagnostics, log analysis (CVM, Genesis), NCC health checks, and cluster service resolution
- Configure high availability, VM affinity rules, QoS policies, and optimize performance for mission-critical workloads
- Manage AHV networking with OVS bridges, VLANs, bonds, LACP and implement resource reservations and workload balance.
- Design, deploy, and maintain hybrid cloud infrastructure across Nutanix HCI, AWS, and GCP platforms
- Architect and implement multi-cloud solutions ensuring high availability, scalability, and disaster recovery
Cloud Platform Engineering
- Architect and deploy enterprise-scale, highly available multi-cloud solutions across AWS and GCP with multi-region/multi-account strategies
- Expert-level proficiency with AWS CLI, GCP CLI, SDK, boto3, and Python for advanced automation and infrastructure orchestration
- Design AWS Organizations and GCP Organization hierarchies with consolidated billing, IAM policies, and centralized governance
- Configure and manage AWS Systems Manager (SSM) including Session Manager, Run Command, State Manager, and Automation for centralized fleet operations
- Implement centralized logging using CloudWatch/CloudTrail and GCP Cloud Logging with S3/Cloud Storage aggregation
- Integrate AWS and GCP with Splunk using HEC, CloudWatch subscriptions, Pub/Sub, Dataflow, and cloud-specific add-ons for SIEM correlation
- Design and deploy advanced load balancing solutions with AWS ALB/NLB/ELB and GCP Cloud Load Balancing including SSL termination and auto-scaling
- Develop infrastructure-as-code using Terraform, CloudFormation, CDK for repeatable multi-cloud deployments and CI/CD pipelines
- Configure AWS SSO, cross-account IAM roles, GCP Workload Identity, and federated access for centralized identity management
- Design VPC architectures with AWS Transit Gateway/PrivateLink and GCP Shared VPC/VPC peering for hybrid connectivity
- Manage containerized workloads using EKS, GKE, ECS, Cloud Run with service mesh, observability, and security best practices
- Implement disaster recovery using AWS Backup, Cross-Region Replication, GCP snapshots, and multi-region failover strategies
- Troubleshooting using CloudWatch Insights, GCP Cloud Trace, VPC Flow Logs, X-Ray, and vendor support escalation
- Perform cost optimization through Reserved Instances, Committed Use Discounts, rightsizing, and automated resource lifecycle management
System Administration
- Administer and support Windows Server and Unix/Linux environments in production and non-production settings
- Perform OS-level hardening, patch management, and security compliance across heterogeneous systems
- Automate routine administrative tasks using PowerShell, Bash, Python, or similar scripting languages
- Manage GitHub organization settings, user permissions, repository access controls, and monitor GitHub Actions workflows and repository health across multiple teams
- Configure Splunk forwarders, heavy forwarders and other integrations for data ingestion from cloud and on-premises sources
PERSONAL AND PROFESSIONAL QUALIFICATIONS:
- 3+ years infrastructure experience in Nutanix HCI or other virtualization( VMware) and enterprise cloud (AWS/GCP)
- Expert-level skills in Python, PowerShell, Bash scripting, infrastructure-as-code (Terraform/CloudFormation), and container orchestration (Kubernetes, EKS/GKE)
- Proven experience managing enterprise-scale environments, hybrid cloud migrations, disaster recovery, and L3 critical incident management
- Good networking knowledge (TCP/IP, VLANs, routing, VPN), security hardening, and compliance frameworks (ITIL)
- Self-motivated continuous learner committed to staying current with evolving cloud technologies and automation opportunities
- Available for on-call rotations with strong documentation skills and customer service orientation
- Certifications (plus): Nutanix NCP/NCAP, AWS Solutions Architect Professional, AWS DevOps Professional, GCP Professional Cloud Architect, Terraform
EDUCATION:
- Bachelor's or master's degree in computer science/IT