Search by job, company or skills

Infobell IT

HPC System Administrator(6+ years)

new job description bg glownew job description bg glownew job description bg svg
  • Posted a month ago
  • Be among the first 10 applicants
Early Applicant

Job Description

HPC System Administrator

Job Summary We are seeking an experienced High-Performance Computing (HPC) System Administrator to manage, maintain, and optimize large-scale HPC clusters and infrastructure. This role focuses on ensuring reliable system operations, implementing robust monitoring solutions, managing user environments, and maintaining high availability of compute resources for research and production workloads.

Key Responsibilities

Install, configure, and maintain HPC cluster hardware and software components

Manage job scheduling systems (SLURM, PBS, LSF) and optimize queue configurations

Monitor system performance, resource utilization, and cluster health using monitoring tools

Administer user accounts, permissions, and resource allocations across compute nodes

Deploy and maintain software stacks, compilers, libraries, and scientific applications

Implement and maintain backup strategies and disaster recovery procedures

Troubleshoot hardware failures, network issues, and software conflicts

Perform regular system updates, security patches, and maintenance windows

Manage storage systems including parallel file systems (Lustre, GPFS, BeeGFS)

Coordinate with vendors for hardware support and warranty services

Create and maintain system documentation and operational procedures

Required Qualifications

Bachelor's degree in Computer Science, Information Technology, or related field

6+ years of experience administering Linux-based HPC systems

Strong knowledge of Linux system administration (RHEL, CentOS, Ubuntu)

Experience with job scheduling systems (SLURM preferred)

Proficiency in shell scripting (Bash) and system automation

Knowledge of networking concepts including InfiniBand and Ethernet fabrics

Experience with configuration management tools (Ansible, Puppet, Chef)

Understanding of parallel file systems and storage technologies

Familiarity with HPC interconnects and high-speed networking

Experience with system monitoring tools (Nagios, Zabbix, Ganglia)

Preferred Skills

Experience with container technologies (Singularity, Docker) in HPC environments

Knowledge of virtualization technologies (KVM, VMware)

Familiarity with cloud computing platforms and hybrid cloud deployments

Experience with GPU computing and CUDA environments

Understanding of MPI, OpenMP, and other parallel programming models

Knowledge of security best practices for multi-user HPC environments

Experience with database administration (MySQL, PostgreSQL)

Familiarity with ticketing systems and user support workflows

Certification in relevant technologies (Red Hat, CompTIA, vendor-specific)

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 131144261