Search by job, company or skills

Proglite

Senior Site Reliability Engineer

new job description bg glownew job description bg glownew job description bg svg
  • Posted 6 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

ROLE AND RESPONSIBILITIES:

A Site Reliability Engineer (SRE) is expected to own the operational stability and performance of hybrid cloud infrastructure (Nutanix, AWS/GCP). This involves leading automation efforts, architecting for reliability, and acting as the final escalation point for critical incidents to ensure the platform is scalable and efficient.

Nutanix Platform Management

  • Design, deploy, and maintain enterprise-scale Nutanix AHV clusters and Prism Central for multi-cluster management
  • Expert-level proficiency with Nutanix CLI (nCLI and acli) for advanced operations, troubleshooting, and automation
  • Develop automation scripts using Nutanix REST APIs, Python SDK, PowerShell, and Terraform for infrastructure-as-code
  • Create and manage VM templates, golden images, and standardized deployment catalogs for consistent provisioning
  • Design disaster recovery solutions using Leap, Protection Domains, cross-cluster replication, and metro clustering
  • Implement network micro-segmentation using Nutanix Flow and configure RBAC, encryption, and security hardening
  • Lead L3 troubleshooting using advanced diagnostics, log analysis (CVM, Genesis), NCC health checks, and cluster service resolution
  • Configure high availability, VM affinity rules, QoS policies, and optimize performance for mission-critical workloads
  • Manage AHV networking with OVS bridges, VLANs, bonds, LACP and implement resource reservations and workload balance.
  • Design, deploy, and maintain hybrid cloud infrastructure across Nutanix HCI, AWS, and GCP platforms
  • Architect and implement multi-cloud solutions ensuring high availability, scalability, and disaster recovery

Cloud Platform Engineering

  • Architect and deploy enterprise-scale, highly available multi-cloud solutions across AWS and GCP with multi-region/multi-account strategies
  • Expert-level proficiency with AWS CLI, GCP CLI, SDK, boto3, and Python for advanced automation and infrastructure orchestration
  • Design AWS Organizations and GCP Organization hierarchies with consolidated billing, IAM policies, and centralized governance
  • Configure and manage AWS Systems Manager (SSM) including Session Manager, Run Command, State Manager, and Automation for centralized fleet operations
  • Implement centralized logging using CloudWatch/CloudTrail and GCP Cloud Logging with S3/Cloud Storage aggregation
  • Integrate AWS and GCP with Splunk using HEC, CloudWatch subscriptions, Pub/Sub, Dataflow, and cloud-specific add-ons for SIEM correlation
  • Design and deploy advanced load balancing solutions with AWS ALB/NLB/ELB and GCP Cloud Load Balancing including SSL termination and auto-scaling
  • Develop infrastructure-as-code using Terraform, CloudFormation, CDK for repeatable multi-cloud deployments and CI/CD pipelines
  • Configure AWS SSO, cross-account IAM roles, GCP Workload Identity, and federated access for centralized identity management
  • Design VPC architectures with AWS Transit Gateway/PrivateLink and GCP Shared VPC/VPC peering for hybrid connectivity
  • Manage containerized workloads using EKS, GKE, ECS, Cloud Run with service mesh, observability, and security best practices
  • Implement disaster recovery using AWS Backup, Cross-Region Replication, GCP snapshots, and multi-region failover strategies
  • Troubleshooting using CloudWatch Insights, GCP Cloud Trace, VPC Flow Logs, X-Ray, and vendor support escalation
  • Perform cost optimization through Reserved Instances, Committed Use Discounts, rightsizing, and automated resource lifecycle management

System Administration

  • Administer and support Windows Server and Unix/Linux environments in production and non-production settings
  • Perform OS-level hardening, patch management, and security compliance across heterogeneous systems
  • Automate routine administrative tasks using PowerShell, Bash, Python, or similar scripting languages
  • Manage GitHub organization settings, user permissions, repository access controls, and monitor GitHub Actions workflows and repository health across multiple teams
  • Configure Splunk forwarders, heavy forwarders and other integrations for data ingestion from cloud and on-premises sources

PERSONAL AND PROFESSIONAL QUALIFICATIONS:

  • 3+ years infrastructure experience in Nutanix HCI or other virtualization( VMware) and enterprise cloud (AWS/GCP)
  • Expert-level skills in Python, PowerShell, Bash scripting, infrastructure-as-code (Terraform/CloudFormation), and container orchestration (Kubernetes, EKS/GKE)
  • Proven experience managing enterprise-scale environments, hybrid cloud migrations, disaster recovery, and L3 critical incident management
  • Good networking knowledge (TCP/IP, VLANs, routing, VPN), security hardening, and compliance frameworks (ITIL)
  • Self-motivated continuous learner committed to staying current with evolving cloud technologies and automation opportunities
  • Available for on-call rotations with strong documentation skills and customer service orientation
  • Certifications (plus): Nutanix NCP/NCAP, AWS Solutions Architect Professional, AWS DevOps Professional, GCP Professional Cloud Architect, Terraform

EDUCATION:

  • Bachelor's or master's degree in computer science/IT

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 139011535

Similar Jobs