Platform Site Reliability Engineer (SRE)

CirrusLabs

Bengaluru, India

3-6 Years

Save

Posted 10 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

We are CirrusLabs . Our vision is to become the world's most sought-after niche digital transformation company that helps customers realize value through innovation. Our mission is to co-create success with our customers, partners and community. Our goal is to enable employees to dream, grow and make things happen. We are committed to excellence. We are a dependable partner organization that delivers on commitments. We strive to maintain integrity with our employees and customers. Every action we take is driven by value. The core of who we are is through our well-knit teams and employees. You are the core of a values driven organization.

You have an entrepreneurial spirit. You enjoy working as a part of well-knit teams. You value the team over the individual. You welcome diversity at work and within the greater community. You aren't afraid to take risks. You appreciate a growth path with your leadership team that journeys how you can grow inside and outside of the organization. You thrive upon continuing education programs that your company sponsors to strengthen your skills and for you to become a thought leader ahead of the industry curve.

You are excited about creating change because your skills can help the greater good of every customer, industry and community. We are hiring a talented to join our team. If you're excited to be part of a winning team, CirrusLabs (http://www.cirruslabs.io) is a great place to grow your career.

Experience: 3-6 years

Shift Time: 2 PM- 11 PM ISTWe are seeking a Platform Site Reliability Engineer (SRE) to support the reliability, observability, and day-2 operations of modern AI platform environments running performance-sensitive workloads. This role is suited for someone with hands-on experience in production support, monitoring, alerting, incident response, Linux troubleshooting, and operational automation across platform and infrastructure layers.

The ideal candidate has experience with Prometheus, Grafana, and logging/metrics platforms , and can work across compute, platform, DevOps, storage, and network teams to improve service health, reduce alert noise, speed up incident resolution, and strengthen overall platform reliability.

Key Responsibilities

Support reliability and day-2 operations for production platform environments.
Build and maintain monitoring, alerting, dashboards, and operational reporting across infrastructure and platform services.
Use tools such as Prometheus, Grafana, and related observability platforms to track health, availability, capacity, and performance.
Troubleshoot issues across Linux hosts, containers, platform services, and infrastructure dependencies .
Support incident detection, triage, root cause analysis, and post-incident improvements .
Tune alerts and service checks to improve signal quality and reduce false positives.
Partner with platform, compute, storage, DevOps, and network teams to isolate and resolve production issues.
Automate repetitive operational tasks using Bash, Python, Ansible, or similar tools .
Maintain runbooks, monitoring standards, alert documentation, and operational procedures .
Contribute to continuous improvement through standardization, automation, and reliability best practices .

Must Have Skills (3–6 years)

Strong Linux administration and troubleshooting skills
Experience supporting production environments with focus on uptime and operational stability
Experience writing automated tests or synthetic checks for infrastructure/platform validation
Experience with Kubernetes, containers, and distributed platform environments
Hands-on experience with monitoring and alerting in production systems
Experience with Prometheus, Grafana, or similar observability tools
Ability to troubleshoot issues across host, service, infrastructure, and platform layers
Experience with incident triage, support operations, and runbook-driven response
Basic scripting or automation experience using Bash, Python, or Ansible
Strong collaboration skills across platform, infrastructure, DevOps, and support teams
Experience creating or maintaining dashboards, alerts, SOPs, and operational documentation
Hands-on experience with NVIDIA GPU plugins for Kubernetes
Solid understanding of fault-tolerant distributed computing and storage systems, including key reliability, health, and performance metrics to monitor
Hands-on experience using BMC or similar tools for system reboot, diagnostics, and hardware management
Experience patching system software in production or pre-production platform environments
Strong adherence to Agile or Kanban ways of working, including delivering work within defined cadences or flow-based priorities and providing consistent, proactive status updates on progress, risks, and blockers to ensure transparency and predictability.

Nice to Have Skills

Experience with ELK, Loki, OpenSearch , or similar logging tools
Experience with NVIDIA GPU infrastructure (DCGM, GPU Operator, NVAIE)
Exposure to hardware-level telemetry (BMC/IPMI, firmware health, thermal/power monitoring – lower level data points)
Exposure to telemetry, exporters, instrumentation, and service health checks
Experience with capacity monitoring, trend analysis, and performance reporting
Familiarity with RCA, postmortems, SLI/SLO concepts, and reliability improvement practices
Exposure to CI/CD pipelines and Git-based operational workflows