Basic Qualifications:
. BS or MS degree in CS or related engineering or science field with 3+ years of relevant experience
. Experience with benchmarking and troubleshooting or optimizing performance of a system.
. Experience with coding, scripting, and automation.
. Background in Networking.
. General Linux skills.
. Demonstrated ability to lead complex projects, independently resolve ambiguity, collaborate with stakeholders across teams, and communicate effectively.
Desired qualifications:
. Experience working on clusters, e.g., running HPC/AI workloads, or maintaining an HPC/AI system.
. Experience troubleshooting or tuning performance on distributed systems.
. Familiarity with elements of the AI/HPC software stack such as job schedulers (e.g., Slurm) NCCL, RCCL, or MPI or ML frameworks.
. Experience with RDMA Networking, i.e., RoCE or Infiniband.
. Experience architecting or developing solutions on a public cloud platform.
Responsibilities
. Carry out performance studies on GPU clusters with focus on AIML workload performance, network performance and tuning.
. Design and code solutions for performance benchmarking.
. Troubleshoot performance problems on RDMA clusters and perform cluster performance validation, including on very novel and not fully understood systems.
. Document new tools and procedures to a high standard.
. Write whitepapers to disseminate findings of performance studies.
. Participate in architecture design and review, code review, and contribute to roadmap development.
. Mentor junior engineers.
. Participate in operational rotations.
Career Level - IC3