This role focuses on implementing and managing cloud-based infrastructure to support High-Performance Computing (HPC) environments, specifically those enabling data science workloads like AI/ML and image analysis. You'll collaborate with data scientists and ML engineers to deploy scalable machine learning models into production, ensuring the security, scalability, and reliability of cloud HPC systems while optimizing resources for cost-effectiveness. This position, based in Hyderabad, requires you to stay current with cloud services and industry standards, provide technical leadership, and maintain CI/CD pipelines for multi-cloud deployments.
Key Responsibilities:
- Implement and manage cloud-based infrastructure that supports HPC environments (e.g., AI/ML workflows, Image Analysis).
- Collaborate with data scientists and ML engineers to deploy scalable machine learning models into production.
- Ensure the security, scalability, and reliability of HPC systems in the cloud.
- Optimize cloud resources for cost-effective and efficient use.
- Stay abreast of the latest in cloud services and industry standard processes.
- Provide technical leadership and guidance in cloud and HPC systems management.
- Develop and maintain CI/CD pipelines for deploying resources to multi-cloud environments.
- Monitor and fix cluster operations/applications and cloud environments.
- Document system design and operational procedures.
What We Expect of You
We are all different, yet we all use our unique contributions to serve patients. The proactive and technically adept professional we seek has these qualifications.
Basic Qualifications:
- Master's degree with 4-6 years of experience in Computer Science, IT, or a related field with hands-on HPC administrationOR
- Bachelor's degree with 6-8 years of experience in Computer Science, IT, or a related field with hands-on HPC administrationOR
- Diploma with 10-12 years of experience in Computer Science, IT, or a related field with hands-on HPC administration
- Demonstrable experience in cloud computing (preferably AWS) and cloud architecture.
- Experience with containerization technologies (Singularity, Docker) and cloud-based HPC solutions.
- Experience with infrastructure-as-code (IaC) tools such as Terraform, CloudFormation, Packer, Ansible, and Git.
- Expert with scripting (Python or Bash) and Linux/Unix system administration (preferably Red Hat or Ubuntu).
- Proficiency with job scheduling and resource management tools (SLURM, PBS, LSF, etc.).
- Knowledge of storage architectures and distributed file systems (Lustre, GPFS, Ceph).
- Understanding of networking architecture and security best practices.
Preferred Qualifications:
- Experience supporting research in healthcare life sciences.
- Experience with Kubernetes (EKS) and service mesh architectures.
- Knowledge of AWS Lambda and event-driven architectures.
- Exposure to multi-cloud environments (Azure, GCP).
- Familiarity with machine learning frameworks (TensorFlow, PyTorch) and data pipelines.
- Certifications in cloud architecture (AWS Certified Solutions Architect, Google Cloud Professional Cloud Architect, etc.).
- Experience in an Agile development environment.
- Prior work with distributed computing and big data technologies (Hadoop, Spark).
Professional Certifications (Preferred):
- Red Hat Certified Engineer (RHCE) or Linux Professional Institute Certification (LPIC)
- AWS Certified Solutions Architect - Associate or Professional
Soft Skills:
- Strong analytical and problem-solving skills.
- Ability to work effectively with global, virtual teams.
- Effective communication and collaboration with cross-functional teams.
- Ability to work in a fast-paced, cloud-first environment.