Job Overview:
As a Site Reliability Engineer (SRE) working in a 24/7 shift rotation, you will be responsible for ensuring the reliability, availability, and performance of critical systems and services. You will combine strong technical skills with operational excellence to proactively monitor, troubleshoot, and resolve issues. Your expertise in observability will help maintain robust monitoring, alerting, and incident response processes, ensuring seamless operations around the clock. This role demands 24x7 monthly rotational shifts.
Main Responsibilities:
- Monitor production systems and services using observability tools (logs, metrics, traces, dashboards).
- Respond to incidents, alerts, and outages in real time, ensuring rapid resolution and minimal impact.
- Participate in a rotating on-call schedule, providing support during nights, weekends, and holidays.
- Design, implement, and maintain observability solutions (e.g., Prometheus, Grafana, ELK and similar tools).
- Develop and refine dashboards, alerts, and automated health checks for critical infrastructure and applications.
- Analyze system performance and reliability data to identify trends and prevent future incidents.
- Collaborate with development, infrastructure, application, and security teams to ensure system reliability and scalability.
- Automate operational tasks and incident response processes using scripting and configuration management tools.
- Document procedures, runbooks, and incident reports for knowledge sharing and continuous improvement.
- Conduct post-incident reviews and root cause analysis to drive improvements in reliability and response.
Key Requirements:
- Bachelor's degree in Information Technology, Computer Science, Business Administration, or a related field.
- Minimum of 2-5 years of experience in cloud engineering and operations engineering.
- Proven experience with Azure services; experience with AWS and GCP is an advantage.
- Hands-on experience with Infrastructure-as-Code (IaC) tools such as Terraform.
- Strong scripting skills in Python, Bash, or PowerShell for automation tasks.
- Familiarity with Gitlab CI/CD tools and experience integrating them with Azure.
- Proficiency in monitoring and logging tools such as native cloud tools, OpenMetrics, OpenTelemetry.
Nice to Have:
- Master's degree or relevant certifications.
Other Details:
This position offers the flexibility of a hybrid work environment. Gain valuable experience in cloud and AI technology while being part of a highly motivated team. Enjoy a competitive remuneration package while charting your own course for career advancement.