
Search by job, company or skills
Job Summary
Responsibilities
Lead efforts to improve observability, including dashboards, alerting, metrics, and log monitoring.
Drive reliability enhancements across systems, with focus on reducing downtime, improving MTTR, and managing SLOs.
Identify, quantify, and eliminate TOIL through automation and process improvements.
Troubleshoot and resolve complex, escalated technical issues across application, database, infrastructure, and network layers.
Analyse logs, traces, metrics, and distributed system behaviour to isolate root causes.
Reproduce issues in staging environments to validate and verify fixes.
Work closely with Development, Product, QA, and Support (L1/L2/L3) teams to ensure end-to-end service reliability.
Document resolutions, write internal run-books, contribute to knowledge base articles, and suggest system/process improvements.
Participate in 24x7 on-call rotations, ensuring high-quality incident response.
Monitor and optimise the performance, cost, and availability of customer-facing AWS infrastructure.
Must Have
Bachelor's degree in Computer Science, IT, or related field, or equivalent practical experience.
24 years of experience in SRE, L3 Technical Support, Reliability Engineering, or similar hands-on technical operations role.
Hands-on experience with AWS services such as EC2, ECS, S3, RDS, CloudWatch
Proficient in log analysis, SQL queries, tracing tools, and debugging REST APIs.
Strong analytical, problem-solving, and structured troubleshooting skills.
Excellent communication skills and a customer-first mindset for handling escalations.
Ability to multitask and prioritise tasks in a fast-paced engineering environment.
Willingness to work in rotational shifts and on-call schedules (247).
Experience working in Public Cloud - AWS(preferred), GCP, Azure
Preferred Skills
Hands-on experience with observability platforms such as ELK Stack, Datadog, CloudWatch, or similar including building dashboards and alert rules.
Strong scripting skills (Python preferred; Bash acceptable
Experience working in SaaS product companies
Experience with CRM and workflow tools (Salesforce, Zoho, HubSpot).
Familiarity with JIRA, Confluence, Git, and engineering collaboration tools.
Working knowledge of CI/CD concepts and pipelines.
Experience with IaC tools like Terraform.
At SMS Magic, we believe that the growth of our people is directly linked to the growth of our company. Our culture fosters high-performance teaming, allowing individuals to reach their full potential while contributing to a world-class CRM messaging company.
We offer:
The freedom and flexibility to manage your role in a way that works best for you.
Exposure to a dynamic and expanding global business environment.
Access to innovative and cutting-edge technology and tools.
Opportunities to apply analytical capabilities and make a significant impact on business teams.
A competitive compensation package, with rewards based on performance and contributions.
A work environment that promotes balance, ensuring employees maintain an active, healthy, and fulfilling life inside and outside of work.
Whenever you join, however long you stay, the exceptional SMS Magic experience lasts a lifetime.
Job ID: 139212487