We are looking for an experienced SeniorSite Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our production systems. The ideal candidate will have strong troubleshooting skills, hands-on experience with messaging queues, in-memory queues, Kubernetes, and deployment automation, along with expertise in Infrastructure as Code and microservices architecture.
Key Responsibilities
- Application Troubleshooting: Diagnose and resolve complex application issues in production environments.
- Queue Management: Work with messaging queues (Kafka, RabbitMQ) and in-memory queues (Redis) to maintain system performance.
- Deployment & Automation: Manage deployments using CI/CD pipelines and automation tools.
- Kubernetes Administration: Maintain and optimize Kubernetes clusters for high availability and scalability.
- Production Support: Provide support for critical production systems, ensuring uptime and reliability.
- Monitoring & Alerting: Implement and maintain monitoring solutions (Prometheus, Grafana, ELK stack).
- Incident Management: Lead root cause analysis and post-mortem reviews for production incidents.
Must-Have Skills
- Strong experience in troubleshooting application issues in distributed systems.
- Hands-on experience with messaging queues (Kafka, RabbitMQ) and in-memory queues (Redis).
- Proficiency in Kubernetes and container orchestration.
- Experience with CI/CD pipelines and deployment automation.
- Solid understanding of Linux systems, networking, and cloud platforms (AWS, Azure, or GCP).
- Infrastructure as Code experience (Terraform, Ansible).
- Knowledge of microservices architecture.
- Strong scripting and automation skills (Python, Bash, or similar).
- Database expertise: Working experience with MySQL/Oracle/MongoDB.
Nice-to-Have
- Experience with WhatsApp Business Messaging APIs and related integration skills.
- Experience with security best practices in production environments.
- Familiarity with observability tools and performance tuning.