Senior High Performance Computing Engineer

4-8 Years

Save

Early Applicant

Quick Apply

Job Description

Implement and manage cloud-based infrastructure that supports HPC environments for data science (e.g., AI/ML workflows, Image Analysis).
Collaborate with data scientists and ML engineers to deploy scalable machine learning models into production.
Ensure the security, scalability, and reliability of HPC systems in the cloud.
Optimize cloud resources for cost-effective and efficient use.
Stay ahead with the latest in cloud services and industry-standard processes.
Provide technical leadership and guidance in cloud and HPC systems management.
Develop and maintain CI/CD pipelines for deploying resources to multi-cloud environments.
Monitor and fix cluster operations/applications and cloud environments.
Document system design and operational procedures.

Must-Have Skills:

Expert with Linux/Unix system administration (RHEL, CentOS, Ubuntu, etc.).
Proficiency with job scheduling and resource management tools (SLURM, PBS, LSF, etc.).
Good understanding of parallel computing, MPI, OpenMP, and GPU acceleration (CUDA, ROCm).
Knowledge of storage architectures and distributed file systems (Lustre, GPFS, Ceph).
Experience with containerization technologies (Singularity, Docker) and cloud-based HPC solutions.
Expert in scripting languages (Python, Bash) and containerization technologies (Docker, Kubernetes).
Familiarity with automation tools (Ansible, Puppet, Chef) for system provisioning and maintenance.
Understanding of networking protocols, high-speed interconnects, and security best practices.
Demonstrable experience in cloud computing (AWS, Azure, GCP) and cloud architecture.
Experience with infrastructure as code (IaC) tools like Terraform or CloudFormation and Git.

What we expect of you

We are all different, yet we all use our unique contributions to serve patients.
Expert knowledge in large Linux environments, networking, storage, and cloud-related technologies.
Also, the candidate will have expertise in root-cause analysis and fix while working with a team and stakeholders.
Top-level communication and documentation skills are required.
Expertise in coding in Python, Bash, YAML is expected.

Good-to-Have Skills:

Basic Qualifications:

Bachelor's degree in computer science, IT, or a related field with 6-8 years of hands-on HPC administration or a related field.

Additional Skills:

Experience supporting research in healthcare life sciences.
Deep, extensive experience with High Performance Computing (HPC) and cluster management.
Familiarity with machine learning frameworks (TensorFlow, PyTorch) and data pipelines.
Certifications in cloud architecture (AWS Certified Solutions Architect, Google Cloud Professional Cloud Architect, etc.).
Experience in an Agile development environment.
Prior work with distributed computing and big data technologies (Hadoop, Spark).

Professional Certifications (preferred):