Search by job, company or skills

SISL Global

Linux System Administrator

new job description bg glownew job description bg glownew job description bg svg
  • Posted 2 days ago
  • Be among the first 30 applicants
Early Applicant

Job Description

Job Description HPC Engineer (HPC with SLURM, CPU & GPU Clusters)

Position Overview

We are seeking a skilled HPC Engineer to design, deploy, manage, and optimize our on premises High Performance Computing (HPC) environment, consisting of SLURM-managed CPU and GPU clusters. The ideal candidate will have a strong understanding of HPC architecture, Linux systems, job scheduling, and cluster operations. Experience with parallel file systems and enterprise storage solutions such as WekaFS or Scality is preferred but optional.

Key Responsibilities

1. HPC Infrastructure & Operations

Manage day to day operations of on prem HPC clusters including CPU and GPU compute nodes.

Monitor cluster health, performance, and utilization, ensuring high availability and efficiency.

Implement and maintain best practices for HPC operations, user management, and resource administration.

Troubleshoot cluster related issues including networking, node failures, job failures, and performance bottlenecks.

Support users in job submissions, resource usage, and HPC workflows.

2. SLURM Workload Manager (Mandatory)

Configure, install, and manage SLURM workload manager across multiple clusters.

Handle queue creation, partition configuration, node allocation, fair share policies, and job prioritization.

Perform SLURM upgrades, migrations, and service maintenance with hands on expertise.

Work with SLURM APIs and integrations to support automation and custom workflows.

Optimize scheduling policies for mixed CPU/GPU workloads.

3. Linux System Administration

Manage Linux-based compute nodes, head nodes, and administration servers.

Perform OS updates, package installations, security patching, and system tuning.

Knowledge of shell scripting (Bash/Python) for automation and HPC tooling workflows.

4. Parallel Computing & Cluster Architecture

Understanding of parallel computing concepts: MPI, OpenMP, distributed execution.

Familiarity with HPC building blocks: interconnect networks (InfiniBand/100G), storage tiers, resource managers, monitoring tools.

Ability to analyze and troubleshoot performance issues in parallel workloads.

5. Storage (Optional but Preferred)

A. WEKA (WekaFS) Optional

Knowledge of parallel file systems and performance tuning.

Diagnose and resolve issues related to WekaFS with minimal downtime.

Provide guidance to internal teams on WekaFS usage and best practices.

Stay updated with Weka ecosystem advancements and propose improvements.

B. Scality Optional

Troubleshoot and maintain Scality RING and ARTESCA environments.

Monitor, tune, and optimize Scality-based storage for high availability and reliability.

Create and maintain documentation for Scality configuration and SOPs.

Recommend performance improvements based on new Scality enhancements.

Qualifications & Skills

Mandatory Skills

Experience managing HPC clusters with SLURM in production environments.

Good understanding of Linux (RHEL) administration.

Knowledge of parallel computing concepts and HPC architecture.

Strong troubleshooting and diagnostic skills.

Ability to work in complex, multi-node distributed environments.

Preferred/Optional Skills

Experience with WekaFS, Scality RING, or other parallel/distributed file systems.

Exposure to GPU computing (CUDA, NVIDIA drivers, GPU scheduling).

Familiarity with monitoring tools (Grafana, Prometheus).

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 143248023