Search by job, company or skills

talentiser

AI runtime and MLOps Engineer

Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted 22 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

JD -ML Platform (AI Runtime & MLOps Stack)

Team: AI Platform Engineering

About the AI Platform

We are building a next-generation AI platform to power intelligent, AI-driven

experiences across our global marketplace. Our platform supports the full lifecycle of large-scale

foundation models—from distributed pretraining on high-performance GPU clusters to

high-throughput production inference—enabling commerce intelligence for hundreds of millions

of users worldwide.

We focus on building state-of-the-art AI runtime infrastructure leveraging vLLM and

TensorRT-LLM as pluggable inference engines behind a standardized AI runtime layer,

alongside Megatron-LM and DeepSpeed for distributed training—integrated with provisioned

throughput management, a distributed KV cache, prefill/decode disaggregation, and a robust

MLOps stack spanning experiment management, fine-tuning automation, and production

Observability.

About the Role

We are looking for an experienced Software Engineer specializing in AI runtimes and MLOps to

design and operate the systems that bring company's foundation models from research to

production. You will own the inference runtime stack, the distributed training infrastructure, and

the MLOps tooling that ties them together—enabling ML researchers and Applied Scientists to

move fast without sacrificing reliability or performance.

You will work on production LLM/VLM inference serving with vLLM and TensorRT-LLM via a

standardized AI runtime layer, implement distributed inference optimizations including

prefill/decode disaggregation, distributed KV cache management, and LLM-aware request

routing—develop large-scale distributed training pipelines using Megatron-LM and DeepSpeed

on high-performance GPU clusters—and build the MLOps stack that automates the end-to-end

model lifecycle.

Key Responsibilities

● Build and operate production AI inference runtimes using vLLM and TensorRT-LLM

behind a standardized AI runtime layer.

● Implement and optimize distributed inference architectures with prefill/decode

disaggregation.

● Design and optimize a distributed KV cache system across nodes.

● Develop and optimize large-scale distributed training pipelines using Megatron-LM and

DeepSpeed.

● Profile and resolve distributed training bottlenecks using NVIDIA and PyTorch

performance tools.

● Implement inference optimizations such as quantization, speculative decoding,

continuous batching, and FlashAttention.

● Build and operate an Inference Request Router for authentication, routing, and

throughput management.

● Develop and operate multi-LoRA adapter hosting with hot-swap routing and lifecycle

management.

● Build and maintain the MLOps stack, including experiment tracking, model versioning,

automated evaluation, and CI/CD.

● Develop and operate fine-tuning pipelines such as SFT, RLHF, DPO, and LoRA.

● Build fault-tolerant distributed training infrastructure with checkpointing, failure detection,

and recovery.

● Build regression testing and benchmarking systems to improve training and inference

performance.

What We're Looking For

● Bachelor's or Master's degree in Computer Science, Engineering, or a related field.

● 5+ years of experience building distributed systems or ML platform infrastructure.

● Strong programming skills in Python and/or C++.

● Familiarity with CUDA or Triton kernel development is a plus.

● Hands-on experience deploying and operating LLM inference engines such as vLLM,

TensorRT-LLM, NVIDIA Triton, or SGLang.

● Deep understanding of LLM inference internals, including KV cache management,

PagedAttention, continuous batching, and request routing.

● Experience building or optimizing distributed training pipelines using Megatron-LM,

DeepSpeed, FSDP, or equivalent frameworks.

● Strong understanding of model parallelism strategies and their trade-offs.

● Proficiency with NVIDIA tooling such as NCCL, DCGM, Nsight Systems, and PyTorch

Profiler.

● Experience implementing inference optimizations including quantization, speculative

decoding, FlashAttention, and multi-LoRA serving.

● Experience building MLOps workflows including experiment tracking, model registry,

evaluation automation, and CI/CD.

● Experience developing fine-tuning pipelines such as SFT, RLHF, DPO, or LoRA at scale.

● Strong expertise in Kubernetes and containerized GPU environments.

● Strong debugging and performance optimization skills across CUDA runtimes,

distributed training, and ML serving systems.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 147185393