
Search by job, company or skills
JD -ML Platform (AI Runtime & MLOps Stack)
Team: AI Platform Engineering
About the AI Platform
We are building a next-generation AI platform to power intelligent, AI-driven
experiences across our global marketplace. Our platform supports the full lifecycle of large-scale
foundation models—from distributed pretraining on high-performance GPU clusters to
high-throughput production inference—enabling commerce intelligence for hundreds of millions
of users worldwide.
We focus on building state-of-the-art AI runtime infrastructure leveraging vLLM and
TensorRT-LLM as pluggable inference engines behind a standardized AI runtime layer,
alongside Megatron-LM and DeepSpeed for distributed training—integrated with provisioned
throughput management, a distributed KV cache, prefill/decode disaggregation, and a robust
MLOps stack spanning experiment management, fine-tuning automation, and production
Observability.
About the Role
We are looking for an experienced Software Engineer specializing in AI runtimes and MLOps to
design and operate the systems that bring company's foundation models from research to
production. You will own the inference runtime stack, the distributed training infrastructure, and
the MLOps tooling that ties them together—enabling ML researchers and Applied Scientists to
move fast without sacrificing reliability or performance.
You will work on production LLM/VLM inference serving with vLLM and TensorRT-LLM via a
standardized AI runtime layer, implement distributed inference optimizations including
prefill/decode disaggregation, distributed KV cache management, and LLM-aware request
routing—develop large-scale distributed training pipelines using Megatron-LM and DeepSpeed
on high-performance GPU clusters—and build the MLOps stack that automates the end-to-end
model lifecycle.
Key Responsibilities
● Build and operate production AI inference runtimes using vLLM and TensorRT-LLM
behind a standardized AI runtime layer.
● Implement and optimize distributed inference architectures with prefill/decode
disaggregation.
● Design and optimize a distributed KV cache system across nodes.
● Develop and optimize large-scale distributed training pipelines using Megatron-LM and
DeepSpeed.
● Profile and resolve distributed training bottlenecks using NVIDIA and PyTorch
performance tools.
● Implement inference optimizations such as quantization, speculative decoding,
continuous batching, and FlashAttention.
● Build and operate an Inference Request Router for authentication, routing, and
throughput management.
● Develop and operate multi-LoRA adapter hosting with hot-swap routing and lifecycle
management.
● Build and maintain the MLOps stack, including experiment tracking, model versioning,
automated evaluation, and CI/CD.
● Develop and operate fine-tuning pipelines such as SFT, RLHF, DPO, and LoRA.
● Build fault-tolerant distributed training infrastructure with checkpointing, failure detection,
and recovery.
● Build regression testing and benchmarking systems to improve training and inference
performance.
What We're Looking For
● Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
● 5+ years of experience building distributed systems or ML platform infrastructure.
● Strong programming skills in Python and/or C++.
● Familiarity with CUDA or Triton kernel development is a plus.
● Hands-on experience deploying and operating LLM inference engines such as vLLM,
TensorRT-LLM, NVIDIA Triton, or SGLang.
● Deep understanding of LLM inference internals, including KV cache management,
PagedAttention, continuous batching, and request routing.
● Experience building or optimizing distributed training pipelines using Megatron-LM,
DeepSpeed, FSDP, or equivalent frameworks.
● Strong understanding of model parallelism strategies and their trade-offs.
● Proficiency with NVIDIA tooling such as NCCL, DCGM, Nsight Systems, and PyTorch
Profiler.
● Experience implementing inference optimizations including quantization, speculative
decoding, FlashAttention, and multi-LoRA serving.
● Experience building MLOps workflows including experiment tracking, model registry,
evaluation automation, and CI/CD.
● Experience developing fine-tuning pipelines such as SFT, RLHF, DPO, or LoRA at scale.
● Strong expertise in Kubernetes and containerized GPU environments.
● Strong debugging and performance optimization skills across CUDA runtimes,
distributed training, and ML serving systems.
Job ID: 147185393
We don’t charge any money for job offers