Search by job, company or skills

RSM US LLP

Site Reliability Engineer Senior 1

new job description bg glownew job description bg glownew job description bg svg
  • Posted 6 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description


Job Description

The Senior Platform Site Reliability Engineer ensures the reliability, scalability, and availability of NAS AI Ecosystem platforms. This role combines software engineering and operations to automate platform operations, improve observability, and maintain stable production environments for AI, data, and backend services.

Job Profile Responsibilities

  • Implement reliability engineering practices for AI and data platforms
  • Define and monitor SLIs, SLOs, and SLAs
  • Automate operational processes to reduce manual effort
  • Manage monitoring, logging, and alerting systems
  • Perform incident response and root cause analysis
  • Improve scalability, resilience, and disaster recovery capabilities
  • Partner with engineering teams to embed reliability into system design
  • Maintain CI/CD pipelines and deployment strategies
  • Ensure security and compliance across infrastructure
  • Participate in production support and on-call rotations

Requirements & Qualifications

Minimum Requirements

  • Experience in Site Reliability Engineering, DevOps, or Platform Engineering
  • Proficiency in Python, Go, or Bash
  • Experience with Azure, AWS, or GCP
  • Hands-on experience with Docker and Kubernetes
  • Experience with Prometheus, Grafana, Azure Monitor, or ELK
  • Experience with Terraform, ARM, or CloudFormation
  • Strong understanding of networking and distributed systems

Preferred Requirements

  • Experience supporting AI/ML or data platforms
  • Knowledge of chaos engineering and resiliency testing
  • Cloud or Kubernetes certifications
  • Experience with high-availability, multi-region systems

Educational Requirements

  • Bachelor's degree

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 144016943