Overview:
We are looking for a Senior Site Reliability Engineer to join our Engineering Infrastructure team. In this role, you will own the reliability, performance, and operational excellence of Generac's cloud-native software platforms. You will bridge the gap between development and operationsembedding SRE practices across engineering squads, driving automation, and ensuring our systems meet the highest availability and performance standards. You will report directly to the Sr. Manager, Site Reliability Engineering.
Responsibilities:
1. Incident Response & On-Call Management
- Own and lead incident response for production outages, coordinating cross-functional teams to drive rapid resolution and minimize customer impact.
- Maintain and evolve on-call runbooks, escalation paths, and post-mortem processes to build a culture of blameless learning.
- Conduct thorough root cause analysis (RCA) and implement preventive measures to reduce mean time to recovery (MTTR) and mean time between failures (MTBF).
- Define, track, and report on SLOs, SLIs, and error budgets, using Grafana dashboards to surface real-time reliability signals to engineering leadership.
- Champion proactive alerting strategies, eliminating alert fatigue and ensuring actionable notifications reach the right teams at the right time.
2. Infrastructure Automation & IaC
- Design, build, and maintain infrastructure-as-code (IaC) using Terraform and Ansible to provision and manage cloud resources across AWS (primary), GCP, and Azure.
- Automate repeatable operational tasksreducing toil and enabling engineering teams to move faster with confidence.
- Lead Kubernetes cluster management and lifecycle operations, including upgrades, scaling, networking, and security hardening across environments.
- Manage and optimize GitHub Actions CI/CD pipelines, ensuring reliable, fast, and secure software delivery from code commit to production.
- Establish standards and best practices for environment consistency, secret management, and infrastructure drift detection.
3. Performance & Capacity Planning
- Lead capacity planning initiatives for multi-cloud infrastructure (AWS primary, GCP, Azure legacy), ensuring systems scale efficiently to meet business demand.
- Develop load testing frameworks and performance benchmarking strategies to identify bottlenecks before they impact customers.
- Analyze trends in system resource utilization and provide data-driven recommendations for cost optimization and right-sizing.
- Collaborate with engineering leadership on architecture reviews to ensure systems are designed with scalability and reliability as first-class concerns.
- Build and maintain Grafana dashboards and alerting rules that provide end-to-end visibility into system performance and capacity headroom.
4. Developer Tooling & Platform Engineering
- Build and maintain internal developer platforms that improve engineering velocity, standardize observability, and reduce operational complexity.
- Partner with software engineering teams to embed reliability practices early in the SDLCshift-left on reliability, security, and performance.
- Provide SRE consultation to product squads on service architecture, deployment patterns, and observability instrumentation.
- Evangelize and implement best practices around feature flags, canary deployments, blue/green strategies, and rollback mechanisms.
- Contribute to a shared services model that enables development teams to self-serve infrastructure needs safely and efficiently.
Required Qualification:
- 5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
- Deep expertise in AWS (EC2, EKS, RDS, Lambda, S3, CloudWatch, IAM, VPC, Route 53); familiarity with GCP and Azure is a plus.
- Hands-on experience with Kubernetes administration, including cluster upgrades, RBAC, networking (CNI plugins), and storage.
- Proficiency in infrastructure-as-code tools, especially Terraform; experience with Ansible or similar configuration management tools.
- Experience designing and managing GitHub Actions CI/CD pipelines at scale.
- Strong observability skillsexperience with Grafana, Prometheus, or equivalent monitoring and alerting stacks.
- Solid programming or scripting skills in Python, Go, Bash, or similar languages for automation and tooling.
- Demonstrated ability to lead incident response and drive structured post-mortem processes.
- Experience defining and managing SLOs, SLIs, and error budgets in production environments.
- Excellent communication skillsable to translate complex technical concepts for both engineering and business stakeholders.
Preferred Skills:
- Experience working in a multi-cloud environment, particularly managing legacy workloads in GCP or Azure alongside AWS.
- Familiarity with service mesh technologies (Istio, Linkerd) and advanced Kubernetes networking.
- Experience with chaos engineering tools (Chaos Monkey, Gremlin) and fault-injection testing.
- Background in platform engineering or internal developer portal (IDP) development.
- Knowledge of FinOps practices for cloud cost optimization and rightsizing.
- AWS certifications (Solutions Architect, DevOps Engineer) or CKA/CKAD are advantageous.