
Search by job, company or skills
Computer Scientist
Location: Noida
Experience: 6-9 Years
Team: AI Platform Engineering
Role Overview
We are looking for an experienced Infrastructure Developer (6-9 years) to help design, build, and scale the platform that powers our most demanding ML training workloads. This is a hands-on engineering role where you will write production-grade code, drive meaningful technical initiatives, and contribute to the reliability of an infrastructure that thousands of GPU hours depend on every day.
You bring strong Kubernetes skills, solid networking fundamentals, a developer's mindset, and the ability to own projects end-to-end with limited supervision. You have operated systems at significant scale and are ready to step up into broader technical leadership.
About the Platform
You will be working on a cutting-edge platform designed to train and serve large-scale machine learning models. The platform supports everything from small-scale experimentation to large distributed training jobs running on GPU clusters with thousands of accelerators. It provides ML engineers and researchers with the tools to onboard, monitor, and scale their workloads - whether a lightweight prototype or a production-grade deep learning model powering real-world applications.
Key platform capabilities:
. Dynamic GPU orchestration using Kubernetes with custom schedulers and resource topology awareness.
. Training & inference workflows end-to-end pipeline support from data ingestion through model serving.
. Observability & cost tracking full-stack visibility across compute, network, and storage layers.
. Self-service developer tooling enabling high-velocity experimentation without platform bottlenecks.
. Multi-cloud infrastructure primarily AWS with Azure/GCP expansion underway.
Your contributions will directly influence the reliability, scalability, and efficiency of this platform - and the speed at which AI teams can innovate.
What You'll Do
. Build for scale Design and improve Kubernetes-native infrastructure that runs distributed GPU training jobs reliably and efficiently. You will own significant components and drive their evolution.
. Lead focused initiatives Own meaningful projects end-to-end - write design docs, gather input from stakeholders, and deliver under realistic timelines, often collaborating with engineers across time zones.
. Codify infrastructure Define and ship cloud infrastructure through IaC (Terraform/Pulumi). Apply the same rigor, testing, and review discipline to infra changes as to application code.
. Strengthen observability Contribute to and extend deep observability stacks - metrics, distributed tracing, log aggregation, SLO/SLI frameworks - that surface problems before they become incidents.
. Write production code Build automation, internal tooling, operators, and platform services in Go, Python, or Rust. This is not a YAML-only role.
. Own reliability Participate in incident response, post-mortems, and reliability reviews. Drive systemic fixes, not just workarounds. Be a strong contributor to on-call culture.
. Solve hard networking problems Debug and resolve complex cluster networking issues - CNI, BGP, service mesh, DNS at scale, east-west traffic, throughput tuning.
. Mentor and grow Raise the technical bar through code reviews, design feedback, and knowledge sharing with peers and more junior engineers.
What You Bring
Core Requirements
Kubernetes & GPU Infrastructure
. 6-9 years in SRE, platform engineering, or infrastructure roles
. Strong working knowledge of Kubernetes internals: scheduler, kubelet, CRDs, operators, admission controllers
. Hands-on experience running GPU/accelerator training workloads in production
. Familiarity with multi-cluster management and workload placement strategies
. Helm, Kustomize, GitOps (Flux/ArgoCD) - practical experience and good judgment on when to use them
Cloud & Infrastructure as Code
. Solid hands-on AWS experience (VPC, EKS, EC2, S3, IAM TGW a plus)
. Production experience with Terraform or Pulumi - modular and tested
. CI/CD for infrastructure: drift detection, plan gating, rollback strategies
. Working understanding of cost optimization, reserved capacity, and spot instance management
Observability
. Prometheus, Grafana, AlertManager - production experience, not just lab setups
. Exposure to distributed tracing: OpenTelemetry, Jaeger, or Tempo
. Log aggregation: Loki, Elasticsearch/OpenSearch
. Comfort with SLO/SLI design, error budgets, and multi-tier alerting
Networking Fundamentals
. Strong TCP/IP, DNS, TLS, HTTP/2, gRPC fundamentals
. Practical experience with CNI plugins: Cilium, Calico, or Flannel - and their trade-offs
. Familiarity with service mesh (Istio/Linkerd), ingress controllers, and API gateways
. Ability to debug under load: packet captures, eBPF traces, kernel counters
Coding & System Design
. Production-quality code in Go, Python, or Rust - you ship, not just script
. Solid grasp of distributed systems fundamentals: consistency, availability, failure modes
. Experience writing Kubernetes operators or working with controller-runtime patterns
. Engaged code reviewer - thoughtful, constructive, and consistent
. Clear technical writer: design docs, ADRs, runbooks that others can actually use
Collaboration & Ownership
. Has delivered meaningful, cross-functional projects from design to production
. Comfortable with ambiguity - can break down a problem and make progress without a perfect spec
. Experience working async across distributed teams and time zones
. Strong communicator - can explain infra trade-offs clearly to peers and partner teams
. Self-driven - identifies problems, proposes solutions, and follows through to outcomes
Bonus Points
. Azure / GCP hands-on experience
. Familiarity with ML training pipeline internals
. eBPF-based observability or networking
. Chaos engineering or game day participation
. Open-source infrastructure contributions
. Security, compliance, or audit exposure
Why This Role
. You will write software, not just YAML. This is a coding role as much as it is an operations role.
. You will work on real AI infrastructure challenges - the kind that research papers get written about, not buzzword slide decks.
. You will see your impact across developer productivity, platform scalability, and service reliability.
. You will grow. This role gives you room to step into broader technical leadership over time.
. You will join a team that values code quality, systems thinking, blameless culture, and genuine ownership.
. You will work on systems at a scale most engineers never get to touch - thousands of GPUs, petabytes of data movement, milliseconds of scheduling latency that matter.
If you have built and operated real infrastructure, care about doing it well, and are ready to take on broader scope - we want to talk.
About Adobe
Adobe empowers everyone to create through innovative platforms and tools that unleash creativity, productivity and personalized customer experiences. Adobe's industry-leading offerings including Adobe Acrobat Studio, Adobe Express, Adobe Firefly, Creative Cloud, Adobe Experience Platform, Adobe Experience Manager, and GenStudio enable people and businesses to turn ideas into impact, powered by AI and driven by human ingenuity.
Our 30,000+ employees worldwide are creating the future and raising the bar as we drive the next decade of growth. We're on a mission to hire the very best and believe in creating a company culture where all employees are empowered to make an impact. At Adobe, we believe that great ideas can come from anywhere in the organization. The next big idea could be yours.
Let's Adobe together
At Adobe, we believe in creating a company culture where all employees are empowered to make an impact. Learn more about Adobe life, including our , , , comprehensive , the , the we serve, and how you can help us advance our mission of empowering everyone to create.
Adobe is proud to be an employer. We do not discriminate based on gender, race or color, ethnicity or national origin, age, disability, religion, sexual orientation, gender identity or expression, veteran status, or any other protected characteristic.
Adobe aims to make our Careers website and recruiting process accessible to any and all users. If you have a disability or special need that requires accommodation to navigate our website or complete the application process, email or call +1 408-536-3015.
AI Use Guidelines for Interviews:
Our interviews are designed to reflect your own skills and thinking. The use of AI or recording tools during live interviews is not permitted unless explicitly invited by the interviewer or approved in advance as part of a reasonable accommodation. If these tools are used inappropriately or in a way that misrepresents your work, your application may not move forward in the process.
At Adobe, we empower employees to innovate with AI - and we look for candidates eager to do the same. As part of the hiring experience, we provide clear guidance on where AI is encouraged during the process and where it's restricted during live interviews. See how we think about .
Job ID: 147772983
Skills:
Nosql, Tensorflow, Numpy, Pytorch, Docker, Flask, Python, AWS, Sql, Jenkins, Pandas, Gcp, MLops, FastAPI, Azure, Kubernetes, Hugging Face, Pinecone, LangGraph, GitHub Actions, CrewAI, LangChain, Generative AI, FAISS, LoRA, Weaviate, QLoRA, LlamaIndex
We don’t charge any money for job offers