Overview
CoreAIis at the forefront of Microsoft's mission to redefine how software is built and experienced. Weare responsible forbuildingthe foundationalplatforms, services, programming models, and developer experiences that power the next generation of applications using Generative AI. Our work enables developers and enterprises to harness the full potential of AI to create intelligent, adaptive, and transformative software.
The AI Core Infrastructureteam, part of AI Platform team inCoreAIOrganizationis responsible for large-scale, highly reliable and efficient GPU management infrastructure and the inference and training platformsthat powers all of Microsoft's AI workloads,such as M365 CoPilot,GithubCoPilot, Microsoft CoPilot,AI Foundry's Inference and Fine-Tuning offering of OAI and OSS models, and many more.
As a Principal Software Engineeron the AI Core Infrastructure team, you will work oncutting edgeinfrastructure and tools to design, build, and support large scale training and inference platform built on top of latest generation of NVIDIA and AMD GPUs in Azure and Microsoft partner clouds on some of theworld'slargest AISupercomputers.
Microsoft's mission is to empower every person and every organization on the planet to achieve more. Asemployeeswe come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
In alignment with our Microsoft values, we are committed to cultivating an inclusive work environment for all employees to positivelyimpactour culture every day.
#AIPlatform, #AIInfra, #AIInfrastructure, #GPUClusters, #AIJobs, #AzureAI, #AzureCloud
Responsibilities
As a Principal Engineer on the infrastructure fleet management team, your responsibilities include:
- Architect, design, and develop core AI Infrastructure services developed in Go, Rust,Python,C++, and C# deployed on large-scale Kubernetes clusters to supportpre-training and post-training ofstate-of-the-artLLMs, SLMs, multimodal, and code-specific models.
- Design, build, and manage compute, storage and networking sub-system on large-scale GPU clusters to support LLM training, customization, and inference workloads.
- Enhance systems and applications to deliver high stability, low latency, strong security, and maintainability in large-scale complex training environments in Azure and in partner clouds.
- Provide operational support, technical leadership, and vision while contributing to the deployment, monitoring, and continuous improvement of engineeringsystems and practices.
- Support development and troubleshooting from the frontline, resolving complex issues impacting large-scale services.
- Collaborate closely with engineers, data scientists within the team, internal Microsoft Research teams and external enterprises to build better solutions together.
- Provide vision, expertise, and technical leadership to other team members.
- Help to grow talent in these areas.
Qualifications
Required Qualifications:
- Bachelor's or master'sdegree in computer scienceora relatedfield.
- 10+ years designing, developing, and shipping high quality software.
- 4+ years of experience with distributed systems and cloud based infrastructure.
- 2+yearof experience with DevOps practices (CI/CD, automated testing, deployment, etc.).
- Passionate and self-motivated.Strong ability in self-learning, entering new domain, managing through uncertainty in an innovative team environment.
Preferred/Additional Qualifications:
- 10+years of software development experience in C#, C++, Python, or similar languages.
- 6+yearsof experience with containerization tools (e.g., Docker, Kubernetes).
- Knowledge and hands on experience withproductionML systems, large-scale training infrastructure, NCCL, CUDA libraries and tools.
Other Requirements:
Abilityto meet Microsoft,customerand/or government security screening requirements arerequiredfor this role. These requirementsinclude butare not limited to the following specialized security screenings:Microsoft Cloud Background Check: This position will berequiredto pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about
requesting accommodations.