Search by job, company or skills

R

Distributed Systems Engineer

5-15 Years
Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted a month ago
  • Be among the first 20 applicants
Early Applicant
Quick Apply

Job Description

Cluster Management & Distributed Computing Engineer

Job Summary:

We are seeking a Cluster Management & Distributed Computing Engineer to work on large‑scale, multi‑node systems. The role involves developing and maintaining cluster management software, distributed device discovery, and virtualized multi‑tenant platforms. The ideal candidate will have strong expertise in Linux systems, distributed computing concepts, and low‑level debugging.

Key Responsibilities:

  • Design, develop, and optimize cluster management software for large‑scale distributed systems
  • Work with distributed device discovery and coordination protocols such as PAXOS, NCCL, RCCL, or similar technologies
  • Develop and maintain multi‑tenant, virtualized distributed computing environments
  • Handle Linux protocol stack, networking interfaces, and system‑level integration
  • Collaborate closely with hardware, firmware, and platform teams for system bring‑up and optimization
  • Perform hardware/software debugging and root‑cause analysis on complex distributed systems

Required Skills & Qualifications:

  • Strong experience with Linux protocol stack, networking, and interface handling
  • Hands‑on experience in distributed computing systems and cluster‑based architectures
  • Proficient in C/C++ programming on Linux OS environments
  • Experience building or supporting multi‑tenant virtualized systems
  • Working knowledge of Python and Shell scripting for automation and tooling
  • Expertise in HW/SW debugging tools such as:
  • T32
  • JTAG
  • Other low‑level debugging tools

Good to Have:

  • Experience with GPU or accelerator cluster environments
  • Exposure to control plane, resource management, or scheduler frameworks
  • Familiarity with high‑performance computing (HPC) or AI/ML clusters

Keywords / Core Skills:

Cluster Management, Distributed Computing, Linux Networking, Device Discovery, PAXOS, NCCL, RCCL, C++, Linux OS, Virtualization, Multi‑tenant Systems, Python, Shell Scripting, T32, JTAG, HW/SW Debugging

More Info

Job Type:
Function:
Employment Type:
Open to candidates from:
Indian

About Company

Job ID: 145554361

Similar Jobs

Bengaluru, India

Skills:

JavaCNvmeSmbLinuxNfsTlsPythonchaos testingRDMASPDKssdJVM tuningfipsHDDDPDKBSDnistAFAGcS3 object storeobservability