Cluster Management & Distributed Computing Engineer
Job Summary:
We are seeking a Cluster Management & Distributed Computing Engineer to work on large‑scale, multi‑node systems. The role involves developing and maintaining cluster management software, distributed device discovery, and virtualized multi‑tenant platforms. The ideal candidate will have strong expertise in Linux systems, distributed computing concepts, and low‑level debugging.
Key Responsibilities:
- Design, develop, and optimize cluster management software for large‑scale distributed systems
- Work with distributed device discovery and coordination protocols such as PAXOS, NCCL, RCCL, or similar technologies
- Develop and maintain multi‑tenant, virtualized distributed computing environments
- Handle Linux protocol stack, networking interfaces, and system‑level integration
- Collaborate closely with hardware, firmware, and platform teams for system bring‑up and optimization
- Perform hardware/software debugging and root‑cause analysis on complex distributed systems
Required Skills & Qualifications:
- Strong experience with Linux protocol stack, networking, and interface handling
- Hands‑on experience in distributed computing systems and cluster‑based architectures
- Proficient in C/C++ programming on Linux OS environments
- Experience building or supporting multi‑tenant virtualized systems
- Working knowledge of Python and Shell scripting for automation and tooling
- Expertise in HW/SW debugging tools such as:
- T32
- JTAG
- Other low‑level debugging tools
Good to Have:
- Experience with GPU or accelerator cluster environments
- Exposure to control plane, resource management, or scheduler frameworks
- Familiarity with high‑performance computing (HPC) or AI/ML clusters
Keywords / Core Skills:
Cluster Management, Distributed Computing, Linux Networking, Device Discovery, PAXOS, NCCL, RCCL, C++, Linux OS, Virtualization, Multi‑tenant Systems, Python, Shell Scripting, T32, JTAG, HW/SW Debugging