- Implement and manage cloud-based infrastructure that supports HPC environments for data science (e.g., AI/ML workflows, Image Analysis).
- Collaborate with data scientists and ML engineers to deploy scalable machine learning models into production.
- Ensure the security, scalability, and reliability of HPC systems in the cloud.
- Optimize cloud resources for cost-effective and efficient use.
- Stay ahead with the latest in cloud services and industry-standard processes.
- Provide technical leadership and guidance in cloud and HPC systems management.
- Develop and maintain CI/CD pipelines for deploying resources to multi-cloud environments.
- Monitor and fix cluster operations/applications and cloud environments.
- Document system design and operational procedures.
Must-Have Skills:
- Expert with Linux/Unix system administration (RHEL, CentOS, Ubuntu, etc.).
- Proficiency with job scheduling and resource management tools (SLURM, PBS, LSF, etc.).
- Good understanding of parallel computing, MPI, OpenMP, and GPU acceleration (CUDA, ROCm).
- Knowledge of storage architectures and distributed file systems (Lustre, GPFS, Ceph).
- Experience with containerization technologies (Singularity, Docker) and cloud-based HPC solutions.
- Expert in scripting languages (Python, Bash) and containerization technologies (Docker, Kubernetes).
- Familiarity with automation tools (Ansible, Puppet, Chef) for system provisioning and maintenance.
- Understanding of networking protocols, high-speed interconnects, and security best practices.
- Demonstrable experience in cloud computing (AWS, Azure, GCP) and cloud architecture.
- Experience with infrastructure as code (IaC) tools like Terraform or CloudFormation and Git.
What we expect of you
- We are all different, yet we all use our unique contributions to serve patients.
- Expert knowledge in large Linux environments, networking, storage, and cloud-related technologies.
- Also, the candidate will have expertise in root-cause analysis and fix while working with a team and stakeholders.
- Top-level communication and documentation skills are required.
- Expertise in coding in Python, Bash, YAML is expected.
Good-to-Have Skills:
- Experience with Kubernetes (EKS) and service mesh architectures.
- Knowledge of AWS Lambda and event-driven architectures.
- Familiarity with AWS CDK, Ansible, or Packer for cloud automation.
- Exposure to multi-cloud environments (Azure, GCP).
Basic Qualifications:
- Bachelor's degree in computer science, IT, or a related field with 6-8 years of hands-on HPC administration or a related field.
Additional Skills:
- Experience supporting research in healthcare life sciences.
- Deep, extensive experience with High Performance Computing (HPC) and cluster management.
- Familiarity with machine learning frameworks (TensorFlow, PyTorch) and data pipelines.
- Certifications in cloud architecture (AWS Certified Solutions Architect, Google Cloud Professional Cloud Architect, etc.).
- Experience in an Agile development environment.
- Prior work with distributed computing and big data technologies (Hadoop, Spark).
Professional Certifications (preferred):
- Red Hat Certified Engineer (RHCE) or Linux Professional Institute Certification (LPIC).
- AWS Certified Solutions Architect Associate or Professional.
Preferred Qualifications:
Soft Skills:
- Strong analytical and problem-solving skills.
- Ability to work effectively with global, virtual teams.
- Effective communication and collaboration with cross-functional teams.
- Ability to work in a fast-paced, cloud-first environment.