Overview
In Microsoft's CoreAI division, the Azure SRE Agent Platform team builds and runs AI Agents as Service that help Microsoft customers detect, diagnose, and mitigate any production issues across customer's services & workloads running on Microsoft platforms. Think of these agents as virtual SRE teammates that continuously watch your systems, investigate problems, and recommends or performs fixes, with a focus on quality, safety, security, enterprise scale and real-world impact.
Our work spans the full lifecycle of agentic systems in production. We design and improve the core capabilities that shape agent behavior, including tool design, planning and execution loops, orchestration, evaluation, and safety guardrails. We build the operational foundations that make those systems dependable in practice, including observability, progressive delivery, reliability engineering, and live-site learning. And we build the best user experience for our customers to use these agents from any device seamlessly.
We are looking for a full stack Software Engineer II teammate to help build this next generation of agentic systems for cloud operations. This role is for engineers who care deeply about product quality, end-to-end ownership, and the details that separate an exciting prototype from a system people trust during critical moments.
Engineers on our team operate with high autonomy in a highly agile environment: short cycles, thin slices, feature flags, progressive delivery, and constant learning. We are looking for teammates with strong owner's mindset and a strong bias for action - engineers who take ownership of ambiguous problems, adopt modern science research, engineering patterns & practices, move quickly, learn from production, and continuously raise the quality bar as they ship.
Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees, we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day, we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Responsibilities
- Take ownership of important areas of the Azure SRE Agent Platform, including agent capabilities, orchestration, evaluation, user experiences on different form factors and supporting platform services
- Build and iterate on agentic systems, including tools, planning and execution loops, evaluations, and safety mechanisms
- Design and ship reliable capabilities that improve incident detection, diagnosis, mitigation, and operational learning
- Use telemetry, experiments, evaluations, and user feedback to guide iteration and investment
- Contribute to resilient, observable systems that operate safely and effectively in production
- Partner closely with engineers, SREs, and product counterparts to turn ambiguous problems into high-quality shipped solutions
- Participate in debugging, live-site learning, and post-incident hardening to continuously improve system quality
- Contribute to architecture, engineering standards, and development practices across the team
Qualifications
Required Qualifications
- Bachelor's or Master's degree in Computer Science, or equivalent practical experience.
- 4+ years of experience building production software using one or more modern programming languages such as C#, C++, Go, Java or Python.
- Strong understanding of Generative AI & software engineering fundamentals, data structures, and problem-solving.
- Ability to learn new technologies quickly and adapt to deliver customer and business impact.
Other Requirements
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings:
Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years
Preferred Qualifications
- Hands-on experience of building and operating LLM powered agentic systems in production, with direct ownership over quality, reliability, and iterations
- 3+ years of experience building and operating cloud platforms or distributed services, with depth in service architecture, deployment, and observability
- Strong product mindset with a track record of owning ambiguous problem spaces and driving them to high-quality outcomes
- Solid engineering fundamentals, including systems design, performance, and debugging in complex production environments
- Track record of designing, running, and optimizing evaluations for agentic systems, including tools, prompts, and agent loops
- Expertise with Kubernetes, container orchestration, or cloud-native infrastructure is a strong plus
- Experience contributing to or leading open-source projects at scale is a plus
This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled.
Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about
requesting accommodations.