Job Summary
We are looking for an experienced Solution Architect specialising in High-Performance Computing (HPC) and Network Infrastructure. The candidate will be responsible for designing, deploying, and optimising large-scale HPC environments, supporting advanced workloads such as Artificial Intelligence, data analytics, and scientific computing.
Key Responsibilities
- Design and architect HPC cluster solutions, including compute, storage, and high-speed networking
- Develop and implement AI/ML and data-intensive computing solutions using modern technologies
- Deploy, configure, and manage HPC environments using tools like OpenHPC, ROCKS, Bright Cluster Manager, and Kubernetes
- Manage cluster resource managers and schedulers such as Slurm, PBS, LSF, and Torque
- Design and maintain high-performance storage systems (Lustre, GPFS, NFS) and network interconnects (InfiniBand, Ethernet)
- Perform benchmarking, performance tuning, and optimisation of HPC systems
- Collaborate with stakeholders to gather requirements and deliver scalable technical solutions
- Support installation, automation, and OS deployment using tools like Ansible, Puppet, and Chef
- Lead troubleshooting of complex issues related to Linux OS, cluster hardware, GPU computing, and networking
- Prepare technical documentation, RFPs, PRDs, and evaluate hardware vendors/OEMs
- Work closely with research and technical teams to optimise scientific and AI workloads
- Provide guidance on HPC architecture, cloud integration (AWS, Azure, GCP), and virtualisation