Search by job, company or skills

Grab (Grab a Grub Services Ltd)

Senior Site Reliability Engineer

new job description bg glownew job description bg glownew job description bg svg
  • Posted a day ago
  • Be among the first 10 applicants
Early Applicant

Job Description

About the Role

We are seeking a highly skilled and proactive Senior Site Reliability Engineer (SRE) to

join our dynamic team. In this role, you will be the cornerstone of our application

reliability and performance, working across the entire technology stack. You will bridge

the gap between development and operations, taking ownership of our systems health,

from deep-dive debugging and incident response to proactive optimization and

preventive engineering.

Your primary mission will be to build, maintain, and improve highly scalable and

reliable systems, ensuring an exceptional experience for our users.

Key Responsibilities

System Reliability s Performance:

  • Design, implement, and maintain highly available, scalable, and fault-
  • tolerant systems.
  • Ensure performance, quality, and responsiveness of applications.
  • Automate operational processes to improve efficiency and reduce
  • manual toil.

Incident Management s Response:

  • Lead the response to, and resolution of, critical incidents and outages.
  • Participate in an on-call rotation, serving as an escalation point for
  • complex system issues.
  • Work under pressure to diagnose and mitigate service disruptions.

Root Cause Analysis s Preventive Measures:

  • Conduct thorough post-incident reviews and Root Cause Analysis (RCA).
  • Drive the implementation of corrective and preventive actions to avoid
  • problem recurrence.
  • Champion a culture of blameless postmortems and continuous
  • improvement.

Application Maintenance s Support:

  • Provide ongoing support, maintenance, and optimization for applications
  • throughout their lifecycle.
  • Debug complex issues across the entire technology stack, from front-end
  • to back-end and database layers.
  • Collaborate with development teams to improve code deployment,
  • monitoring, and operational readiness.

Monitoring s Observability:

  • Utilize New Relic and other tools to build comprehensive monitoring,
  • alerting, and dashboards.
  • Analyze performance data to identify trends, predict capacity needs, and
  • pinpoint bottlenecks before they impact users.

ualifications & Technical Skills (What We're Looking For)

Must-Have:

  • 4-6 years of experience in a Site Reliability Engineering, DevOps, or a
  • similar software engineering role with a focus on operations.
  • Strong hands-on experience in debugging and supporting applications
  • built on:

PHP and Node.js

MySQL and MongoDB

  • Proven expertise in using New Relic (or similar APM tools like ELK,
  • Splunk) for deep-dive performance analysis and application monitoring.
  • Demonstrable experience in leading incident management, from
  • detection to resolution, and conducting formal RCAs.
  • Solid understanding of Linux/Unix operating systems and networking
  • fundamentals.

Good-to-Have:

  • Proficiency with containerization and orchestration technologies (e.g.,
  • Docker, Kubernetes).
  • Experience with CI/CD pipelines (e.g., Jenkins, GitLab CI, GitHub
  • Actions).
  • Knowledge of cloud platforms (e.g., AWS, GCP, Azure).

More Info

Job Type:
Industry:
Employment Type:

Job ID: 143765223

Similar Jobs