Search by job, company or skills

CirrusLabs

Platform Site Reliability Engineer (SRE)

Save
  • Posted 10 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

We are CirrusLabs . Our vision is to become the world's most sought-after niche digital transformation company that helps customers realize value through innovation. Our mission is to co-create success with our customers, partners and community. Our goal is to enable employees to dream, grow and make things happen. We are committed to excellence. We are a dependable partner organization that delivers on commitments. We strive to maintain integrity with our employees and customers. Every action we take is driven by value. The core of who we are is through our well-knit teams and employees. You are the core of a values driven organization.

You have an entrepreneurial spirit. You enjoy working as a part of well-knit teams. You value the team over the individual. You welcome diversity at work and within the greater community. You aren't afraid to take risks. You appreciate a growth path with your leadership team that journeys how you can grow inside and outside of the organization. You thrive upon continuing education programs that your company sponsors to strengthen your skills and for you to become a thought leader ahead of the industry curve.

You are excited about creating change because your skills can help the greater good of every customer, industry and community. We are hiring a talented to join our team. If you're excited to be part of a winning team, CirrusLabs (http://www.cirruslabs.io) is a great place to grow your career.

Experience: 3-6 years

Shift Time: 2 PM- 11 PM ISTWe are seeking a Platform Site Reliability Engineer (SRE) to support the reliability, observability, and day-2 operations of modern AI platform environments running performance-sensitive workloads. This role is suited for someone with hands-on experience in production support, monitoring, alerting, incident response, Linux troubleshooting, and operational automation across platform and infrastructure layers.

The ideal candidate has experience with Prometheus, Grafana, and logging/metrics platforms , and can work across compute, platform, DevOps, storage, and network teams to improve service health, reduce alert noise, speed up incident resolution, and strengthen overall platform reliability.

Key Responsibilities

  • Support reliability and day-2 operations for production platform environments.
  • Build and maintain monitoring, alerting, dashboards, and operational reporting across infrastructure and platform services.
  • Use tools such as Prometheus, Grafana, and related observability platforms to track health, availability, capacity, and performance.
  • Troubleshoot issues across Linux hosts, containers, platform services, and infrastructure dependencies .
  • Support incident detection, triage, root cause analysis, and post-incident improvements .
  • Tune alerts and service checks to improve signal quality and reduce false positives.
  • Partner with platform, compute, storage, DevOps, and network teams to isolate and resolve production issues.
  • Automate repetitive operational tasks using Bash, Python, Ansible, or similar tools .
  • Maintain runbooks, monitoring standards, alert documentation, and operational procedures .
  • Contribute to continuous improvement through standardization, automation, and reliability best practices .

Must Have Skills (3–6 years)

  • Strong Linux administration and troubleshooting skills
  • Experience supporting production environments with focus on uptime and operational stability
  • Experience writing automated tests or synthetic checks for infrastructure/platform validation
  • Experience with Kubernetes, containers, and distributed platform environments
  • Hands-on experience with monitoring and alerting in production systems
  • Experience with Prometheus, Grafana, or similar observability tools
  • Ability to troubleshoot issues across host, service, infrastructure, and platform layers
  • Experience with incident triage, support operations, and runbook-driven response
  • Basic scripting or automation experience using Bash, Python, or Ansible
  • Strong collaboration skills across platform, infrastructure, DevOps, and support teams
  • Experience creating or maintaining dashboards, alerts, SOPs, and operational documentation
  • Hands-on experience with NVIDIA GPU plugins for Kubernetes
  • Solid understanding of fault-tolerant distributed computing and storage systems, including key reliability, health, and performance metrics to monitor
  • Hands-on experience using BMC or similar tools for system reboot, diagnostics, and hardware management
  • Experience patching system software in production or pre-production platform environments
  • Strong adherence to Agile or Kanban ways of working, including delivering work within defined cadences or flow-based priorities and providing consistent, proactive status updates on progress, risks, and blockers to ensure transparency and predictability.

Nice to Have Skills

  • Experience with ELK, Loki, OpenSearch , or similar logging tools
  • Experience with NVIDIA GPU infrastructure (DCGM, GPU Operator, NVAIE)
  • Exposure to hardware-level telemetry (BMC/IPMI, firmware health, thermal/power monitoring – lower level data points)
  • Exposure to telemetry, exporters, instrumentation, and service health checks
  • Experience with capacity monitoring, trend analysis, and performance reporting
  • Familiarity with RCA, postmortems, SLI/SLO concepts, and reliability improvement practices
  • Exposure to CI/CD pipelines and Git-based operational workflows

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 149082739

Similar Jobs

Bengaluru, India

Skills:

PrometheusGrafanaTerraformHelmKubernetesPythonOPACrossplaneGoKyvernoGitHub ActionsArgoCD