Senior Applied Scientist

Microsoft

Bengaluru, India

5-7 Years

Save

Posted 3 days ago
Be among the first 10 applicants

Early Applicant

Job Description

Overview

Microsoft Ads powers experiences at global scale through large-scale machine learning systems that operate under strict latency, reliability, freshness, and cost constraints. As Ads expands the use of advanced ML and LLM-based systems, inference has become a core production challenge across low-latency online serving, near-real-time decisioning, and large-scale batch workflows.

We are looking for a Senior Applied Scientist / Machine Learning Engineer to optimize end-to-end inference workflows for large-scale Ads models. This role is ideal for someone who is deeply technical, hands-on, and excited to work at the intersection of ML and Systems

In this role, you will partner closely with applied scientists and engineers to translate model innovation into efficient, reliable, and cost-effective production systems. You will work across the inference stack, including runtime optimization, batching, scheduling, routing, caching, parallelism, observability, and resource management, with the goal of improving production impact across Ads scenarios. The role also includes supporting emerging agentic workloads that rely on multi-turn reasoning, tool use, structured generation, and long-context inference.

Responsibilities

Design and optimize end-to-end ML/LLM inference workflows across online low-latency serving, near-real-time inference, and large-scale batch inference scenarios.

Build scalable serving and execution systems for large-scale models, including scheduling, batching, routing, admission control, and resource-aware execution.

Improve inference performance and efficiency across compute, memory, storage, network, and concurrency dimensions, with strong focus on latency, throughput, reliability, and cost.

Develop and apply modern serving techniques such as continuous or dynamic batching, prefix caching, KV-cache optimization, request shaping, tail-latency reduction, and runtime-level performance tuning.

Optimize systems for key generative inference metrics such as time to first token, inter-token latency, throughput, accelerator utilization, and cost per request.

Work on runtime and serving optimizations for modern inference stacks such as vLLM, TensorRT-LLM, SGLang, Triton, ONNX Runtime, and PyTorch-based serving systems.

Partner with applied scientists to productionize new models and inference patterns, including agentic workflows with tool use, structured outputs, and long-context workloads, and evaluate quality-latency-cost tradeoffs in real production scenarios.

Design and improve scheduling and resource management for heterogeneous and multi-tenant inference workloads, including GPU-aware placement, admission control, burst handling, and workload isolation.

Build strong observability and diagnostics for inference services, including bottleneck analysis, performance regression detection, and end-to-end latency and cost measurement.

Qualifications

Bachelor's /Masters Degree in Computer Science, Mathematics, Software Engineering, Computer Engineering, or related technical field, and 5+ years of related experience in machine learning systems, distributed systems, inference infrastructure, or software engineering

OR Doctorate in Computer Science, Mathematics, Software Engineering, Computer Engineering, or related technical field, and 2+ years of related experience.

Strong programming skills in Python, C++, or C#.

Hands-on experience in one or more of the following areas:

Large-scale ML/LLM inference serving in production
MLSys for model deployment, serving, or runtime optimization

Experience building or optimizing systems for online inference, batch inference, or near-real-time inference.

Strong understanding of inference bottlenecks such as batching, queuing, tail latency, KV-cache pressure, memory bandwidth limits, caching, and heterogeneous resource utilization.

Experience with one or more modern inference stacks or runtimes such as vLLM, TensorRT-LLM, SGLang, Triton, ONNX Runtime, DeepSpeed, or PyTorch inference tooling.

Experience with modern LLM inference and serving techniques, including areas such as KV-cache management, prefix caching, speculative decoding, quantization, prefill/decode disaggregation, or MoE inference optimization.

Experience with production-scale model serving platforms and distributed inference systems, including multi-node or multi-tenant deployments, resource-aware scheduling, and optimization across heterogeneous workloads.

This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.