Job Description
We are seeking a proactive and technically strong Site Reliability Engineer (SRE) who can ensure the stability, performance, and resilience of our production systems. The ideal candidate is passionate about reliability, experienced in monitoring complex systems, and skilled in automation and incident management. They should be comfortable working in fast-paced environments, collaborating across teams, and participating in an on-call rotation to maintain 24/7 system availability.
Responsibilities
Review, refine, and implement monitoring tools and dashboards to track system performance and key operational metrics.
Continuously monitor logs and alerts to detect issues and anomalies.
Prioritize incidents based on severity and business impact.
Lead incident resolution efforts, applying fixes and mitigations to restore normal operations swiftly.
Participate in 24/7 on-call rotations for incident response.
Troubleshoot alerts and coordinate with NOC and Engineering teams during critical events.
Support post-incident reviews, identifying root causes and recommending preventive actions.
Develop and enhance automation scripts and tooling to improve operational efficiency and reduce manual tasks.
Maintain and update monitoring, deployment, and incident-management tools.
Analyze application and system performance using profiling tools.
Identify bottlenecks and drive improvements through optimizations, infrastructure upgrades, or architectural changes.
- Capacity Planning & Scaling
Forecast and plan resource requirements based on usage trends.
Scale infrastructure components (e.g., servers, databases) to support growth.
Automate sizing and scaling workflows to ensure efficiency and consistency.
- Disaster Recovery & Redundancy
Adhere to disaster recovery policies and procedures to maintain business continuity.
Implement redundancy, failover, and resilience strategies to minimize downtime.
- Knowledge Sharing & Documentation
Create and maintain documentation for processes, configurations, incident resolutions, and best practices.
Promote knowledge sharing within the team through training and regular sessions.
Incorporate lessons learned from incidents and feedback loops to drive continuous process improvements.
Enhance tools, workflows, and systems to increase reliability and stability.
- Collaboration & Communication
Work closely with Product, Engineering, and Core teams to align operational priorities.
Maintain transparent and timely communication with stakeholders during incidents and ongoing initiatives.
Qualifications & Skills
- 2-3 Years of relevant experience.
- Bachelor's degree in Computer Science, Engineering, or related field (or equivalent experience).
- Strong understanding of Linux/Unix systems and cloud infrastructure.
- Experience with monitoring tools (e.g., Prometheus, Grafana, Datadog, ELK, CloudWatch).
- Hands-on experience with incident management and on-call rotations.
- Proficiency in scripting languages such as Python, Bash, or Go.
- Knowledge of containerization and orchestration (Docker, Kubernetes).
- Strong troubleshooting and analytical skills.
- Familiarity with CI/CD pipelines and DevOps practices.
- Ability to work under pressure during high-severity incidents.
Preferred Abilities
- Experience implementing automation for scaling, failover, and recovery processes.
- Knowledge of distributed systems, microservices, and high-availability architectures.
- Exposure to disaster recovery planning and execution.
- Experience with message queues, caching layers, and database performance tuning.
- Strong communication and cross-team collaboration skills.
- Ability to document complex systems clearly and comprehensively.