
Search by job, company or skills
HPC System Administrator
Job Summary We are seeking an experienced High-Performance Computing (HPC) System Administrator to manage, maintain, and optimize large-scale HPC clusters and infrastructure. This role focuses on ensuring reliable system operations, implementing robust monitoring solutions, managing user environments, and maintaining high availability of compute resources for research and production workloads.
Key Responsibilities
· Install, configure, and maintain HPC cluster hardware and software components
· Manage job scheduling systems (SLURM, PBS, LSF) and optimize queue configurations
· Monitor system performance, resource utilization, and cluster health using monitoring tools
· Administer user accounts, permissions, and resource allocations across compute nodes
· Deploy and maintain software stacks, compilers, libraries, and scientific applications
· Implement and maintain backup strategies and disaster recovery procedures
· Troubleshoot hardware failures, network issues, and software conflicts
· Perform regular system updates, security patches, and maintenance windows
· Manage storage systems including parallel file systems (Lustre, GPFS, BeeGFS)
· Coordinate with vendors for hardware support and warranty services
· Create and maintain system documentation and operational procedures
Required Qualifications
· Bachelor's degree in Computer Science, Information Technology, or related field
· 6+ years of experience administering Linux-based HPC systems
· Strong knowledge of Linux system administration (RHEL, CentOS, Ubuntu)
· Experience with job scheduling systems (SLURM preferred)
· Proficiency in shell scripting (Bash) and system automation
· Knowledge of networking concepts including InfiniBand and Ethernet fabrics
· Experience with configuration management tools (Ansible, Puppet, Chef)
· Understanding of parallel file systems and storage technologies
· Familiarity with HPC interconnects and high-speed networking
· Experience with system monitoring tools (Nagios, Zabbix, Ganglia)
Preferred Skills
· Experience with container technologies (Singularity, Docker) in HPC environments
· Knowledge of virtualization technologies (KVM, VMware)
· Familiarity with cloud computing platforms and hybrid cloud deployments
· Experience with GPU computing and CUDA environments
· Understanding of MPI, OpenMP, and other parallel programming models
· Knowledge of security best practices for multi-user HPC environments
· Experience with database administration (MySQL, PostgreSQL)
· Familiarity with ticketing systems and user support workflows
· Certification in relevant technologies (Red Hat, CompTIA, vendor-specific)
Job ID: 131144261
We don’t charge any money for job offers