
Search by job, company or skills
HPC System Administrator
Job Summary We are seeking an experienced High-Performance Computing (HPC) System Administrator to manage, maintain, and optimize large-scale HPC clusters and infrastructure. This role focuses on ensuring reliable system operations, implementing robust monitoring solutions, managing user environments, and maintaining high availability of compute resources for research and production workloads.
Key Responsibilities
Install, configure, and maintain HPC cluster hardware and software components
Manage job scheduling systems (SLURM, PBS, LSF) and optimize queue configurations
Monitor system performance, resource utilization, and cluster health using monitoring tools
Administer user accounts, permissions, and resource allocations across compute nodes
Deploy and maintain software stacks, compilers, libraries, and scientific applications
Implement and maintain backup strategies and disaster recovery procedures
Troubleshoot hardware failures, network issues, and software conflicts
Perform regular system updates, security patches, and maintenance windows
Manage storage systems including parallel file systems (Lustre, GPFS, BeeGFS)
Coordinate with vendors for hardware support and warranty services
Create and maintain system documentation and operational procedures
Required Qualifications
Bachelor's degree in Computer Science, Information Technology, or related field
6+ years of experience administering Linux-based HPC systems
Strong knowledge of Linux system administration (RHEL, CentOS, Ubuntu)
Experience with job scheduling systems (SLURM preferred)
Proficiency in shell scripting (Bash) and system automation
Knowledge of networking concepts including InfiniBand and Ethernet fabrics
Experience with configuration management tools (Ansible, Puppet, Chef)
Understanding of parallel file systems and storage technologies
Familiarity with HPC interconnects and high-speed networking
Experience with system monitoring tools (Nagios, Zabbix, Ganglia)
Preferred Skills
Experience with container technologies (Singularity, Docker) in HPC environments
Knowledge of virtualization technologies (KVM, VMware)
Familiarity with cloud computing platforms and hybrid cloud deployments
Experience with GPU computing and CUDA environments
Understanding of MPI, OpenMP, and other parallel programming models
Knowledge of security best practices for multi-user HPC environments
Experience with database administration (MySQL, PostgreSQL)
Familiarity with ticketing systems and user support workflows
Certification in relevant technologies (Red Hat, CompTIA, vendor-specific)
Job ID: 131144261