Overview
Deep learning and generative models are reshaping how people discover and engage with ads. Microsoft Ads runs large-scale recommender systems that serve billions of requests under tight latency, cost, and reliability conditions.
Delivering relevant ads efficiently at this scale requires innovation across the full stack: models, kernels, serving systems, and GPU/accelerator infrastructure. We are looking for a Senior Applied Scientist with strong foundations in systems and machine learning, and with experience in one or more of the following areas:
- Large-scale inference and serving architectures
- Ads retrieval, ranking, and recommendation models optimized for online performance
- GPU / accelerator programming and kernel optimization
Primary success metric is (latency, cost, revenue) in production, while also helping shape the technical direction of Ads recommendation and inference
Responsibilities
- Design and optimize end-to-end Ads inference model & workflows for retrieval & ranking meeting strict p99 latency and throughput goals.
- Invent and implement efficiency techniques such as dynamic batching, routing, scheduling, caching, sequence packing, quantization, and speculative decoding to improve utilization and tail latency.
- Develop and tune GPU kernels and operators eg kernel fusion, memory-aware layouts, sparsity.
- Use profiling and diagnostic tools to analyze GPU utilization, memory bandwidth, and kernel performance
- Design and evolve serving architectures for multi-tenant workloads, including policies for placement, parallelism, autoscaling, and safe rollout under real-world SLOs.
- Build and optimize caching layers and KV-cache management (feature/result caches, request deduplication, paging/offload) to improve both latency and efficiency.
- Co-design model architectures that are inference-friendly while preserving or improving quality metrics
Qualifications
- Bachelor's or Master's degree in Computer Science, Electrical/Computer Engineering, or a related field, with 6+ years of related experience.
- Strong programming skills in C++ or Python (both are a plus; at least one is required).
- Hands-on experience in one or more:
- Implementing and deploying deep learning models for online inference,
- Building and operating latency-sensitive online services at scale
- GPU/accelerator programming and performance optimization.
- Experience with deep learning frameworks such as PyTorch, TensorFlow, or JAX.
- Ability to design experiments, analyze results, and make data-driven decisions in complex systems.
- Strong communication and collaboration skills, with experience working across ML, systems, and product or business stakeholders.
Preferred Qualifications
- 3+ years of experience in Kernel programming and Inference optimization
- Experience with inference serving frameworks (for example: vLLM, Triton Inference Server, TensorRT-LLM or similar).
- Deep understanding of inference efficiency techniques for LLM/SLM (paged KV cache, continuous batching/sequence packing, speculative decoding, quantization, adapters/LoRA, sparsity).
- Familiarity with compiler and auto-tuning techniques, automated kernel/code generation, or ML-based performance optimization.
- Background in cost/performance modeling, capacity planning, and autoscaling for large fleets of GPUs or accelerators.
- Experience in Ads, search, recommendations, or similar large-scale ranking systems where latency, cost, and relevance are jointly optimized (strong plus, but not required).
- Track record of impact via research publications, patents, or shipping large-scale systems in ML, systems, or recommendation domains
This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about
requesting accommodations.