We are seeking a Senior High-Performance Computing (HPC) Engineer to deploy, maintain, and support HPC infrastructure in a multi-cloud environment. This hands-on role requires deep technical expertise in HPC technology and is vital for supporting data science, AI/ML workflows, and image analysis. The ideal candidate will have expert knowledge in large Linux environments, networking, storage, and cloud technologies, with a proven ability to perform root-cause analysis.
Roles & Responsibilities
- Infrastructure Management: Implement and manage cloud-based infrastructure that supports HPC environments. Ensure the security, scalability, and reliability of these systems.
- Collaboration & Optimization: Work closely with data scientists and ML engineers to deploy scalable machine learning models. Optimize cloud resources for cost-effective and efficient use.
- Automation & Monitoring: Develop and maintain CI/CD pipelines for deploying resources to multi-cloud environments. Monitor and troubleshoot cluster operations and cloud environments.
- Technical Leadership: Provide technical leadership and guidance in cloud and HPC systems management. Document system design and operational procedures.
Qualifications
- A Bachelor's degree in Computer Science, IT, or a related field with hands-on experience in HPC administration.
- Expert Linux/Unix system administration experience (RHEL, CentOS, Ubuntu, etc.).
- Proficiency with job scheduling and resource management tools (SLURM, PBS, LSF).
- Good understanding of parallel computing, MPI, OpenMP, and GPU acceleration (CUDA, ROCm).
- Knowledge of storage architectures and distributed file systems (Lustre, GPFS, Ceph).
- Expertise in scripting languages (Python, Bash) and containerization technologies (Docker, Kubernetes).
- Experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation and Git.
- Experience in cloud computing (AWS, Azure, GCP) and a strong understanding of cloud architecture.
- Red Hat Certified Engineer (RHCE) or AWS Certified Solutions Architect certifications are preferred.
Skills & Competencies
- Problem-Solving: Strong analytical and problem-solving skills, with expertise in root-cause analysis and troubleshooting.
- Communication: Top-level communication and documentation skills are essential.
- Collaboration: The ability to work effectively with global, virtual, and cross-functional teams in a fast-paced, cloud-first environment.
- Technical: Experience with multi-cloud environments, machine learning frameworks (TensorFlow, PyTorch), and distributed computing technologies is a plus.
- Onsite & On-Call: This position is required to be onsite and involves a 24/5 and weekend on-call rotation, with the possibility of working later shifts.