Job Description
The Senior Platform Site Reliability Engineer ensures the reliability, scalability, and availability of NAS AI Ecosystem platforms. This role combines software engineering and operations to automate platform operations, improve observability, and maintain stable production environments for AI, data, and backend services.
Job Profile Responsibilities
- Implement reliability engineering practices for AI and data platforms
- Define and monitor SLIs, SLOs, and SLAs
- Automate operational processes to reduce manual effort
- Manage monitoring, logging, and alerting systems
- Perform incident response and root cause analysis
- Improve scalability, resilience, and disaster recovery capabilities
- Partner with engineering teams to embed reliability into system design
- Maintain CI/CD pipelines and deployment strategies
- Ensure security and compliance across infrastructure
- Participate in production support and on-call rotations
Requirements & Qualifications
Minimum Requirements
- Experience in Site Reliability Engineering, DevOps, or Platform Engineering
- Proficiency in Python, Go, or Bash
- Experience with Azure, AWS, or GCP
- Hands-on experience with Docker and Kubernetes
- Experience with Prometheus, Grafana, Azure Monitor, or ELK
- Experience with Terraform, ARM, or CloudFormation
- Strong understanding of networking and distributed systems
Preferred Requirements
- Experience supporting AI/ML or data platforms
- Knowledge of chaos engineering and resiliency testing
- Cloud or Kubernetes certifications
- Experience with high-availability, multi-region systems
Educational Requirements