- Run the production environment by monitoringavailability and taking a holistic view of system health.
- Provide predictive insights into the health ofthe system and suggest measures to optimize and safeguard against futureabnormalities.
- Build software and systems to manage platforminfrastructure and applications.
- Improve reliability, quality, andtime-to-market of our suite of our cloud and on-prem software solutions.
- Measure and optimize system performance, withan eye toward pushing our capabilities forward, getting ahead of customerneeds, and innovating for continual improvement.
- Provide primary operational support andengineering for multiple large-scale distributed infrastructure and relatedapplications.
Must Have Skill:
- 5+ years of experience and a proven trackrecord of maintaining and supporting large scale infrastructure and cloudsystems.
- Gather and analyze metrics from operatingsystems as well as applications to assist in performance tuning and faultfinding.
- Partner with development teams to improveservices through rigorous testing and release procedures.
- Participate in system design consulting,platform management, and capacity planning.
- Create sustainable systems and servicesthrough automation and uplifts.
- Balance feature development speed andreliability with well-defined service-level objectives.
- In-depth and hands-on knowledge of automation technologies with extensiveexpertise in Terraform or Ansible.
- In-depth and hands-on knowledge of Linux andMySQL, programming and scripting using Bash, Python/alternate.
- In-depth knowledge of maintaining any on-premcloud solutions like OpenStack / CloudStack / OpenNebula / vCloud etc.
- In-depth and hands-on knowledge of containersand container orchestration using Kubernetes.
- In-Depth and hands on knowledge on anymonitoring system (Prometheus / Nagios / Zabbix / SolarWinds / ManageEngine etc.).Experience of implementing correlation and predictive analysis into monitoringof the systems.
- Hands on extensive experience of implementing,maintaining high availability systems. Ensuring backup and ensuringbusiness continuity in a seamless manner.
- Thorough conceptual knowledge of distributedsystems, storage, networking, SDN, SDS.
Good to Have Skill:
- Knowledge of CloudStack/Citrix CloudPlatformand involvement as an administrator / maintainer / committer / tester / supportengineer.
- Data centre or ISP experience in a similarrole.
- Knowledge of GPU based systems, Nvidia BCM,GPU Virtualisation techniques.
- Worked in supporting AI/ML workloads.
Qualification and Experience:
- Relevant bachelors degree