Job Purpose
Analysing, troubleshooting, and designing vital services, platforms, and infrastructure on GCP while always thinking about reliability, scalability, resilience, security, and performance.
Job Responsibilities(JR) :
- Help build a Site Reliability Engineering culture by sharing the best practices, approaches, documentation, and code with other engineering teams
- Apply automation and software to any tasks or parts of the system which are performed manually
- Able to troubleshoot complicated, cross platform issues handling OS, Networking, Database in a cloud-based SaaS environment and handle live production incidents
- Monitor application performance take steps to improve overall application performance and stability and follow through with implementation
- Design, write, ship, and motivate the creation of software and systems to increase observability, product reliability and organizational efficiency
- Conduct system analysis, configuration management and develops improvements for system software performance, availability and reliability
Key Skills:
- Experience in monitoring and analyzing infrastructure performance using standard performance monitoring tools
- Demonstrable experience in Containerization-Docker and orchestration (Kubernetes)
- Experience with Infrastructure As Code (Terraform, Cloud Formation, Ansible)
- Knowledge and proven hands-on experience in large-scale databases and distributed technologies, such as Kafka and Confluent Platform Kafka
- Basic programming and scripting skills