Search by job, company or skills

  • Posted 3 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Job Scope:

Build and evolve Kubernetes as a core AI infrastructure platform.

Extending Kubernetes, not just operating it

Designing GPU-aware scheduling, isolation, and lifecycle management

Building reliable, multi-tenant AI clusters that do not break under extreme load

Total /Relevant Experience:

6 Plus years of experience

Key Responsibilities:

1. Kubernetes Platform Architecture

  • Design and evolve Kubernetes clusters optimized for:
  • GPU-heavy workloads
  • multi-node, gang-scheduled training jobs
  • long-running and high-throughput inference
  • Own control-plane architecture:
  • etcd sizing and tuning
  • API server scalability
  • scheduler performance under high churn
  • Define reference cluster architectures for:
  • dedicated training clusters
  • shared multi-tenant clusters

2. GPU-Aware Scheduling & Workload Semantics

  • Build or extend scheduling mechanisms for:
  • GPU topology awareness
  • NUMA and locality sensitivity
  • anti-affinity for noisy neighbors
  • Integrate and deeply understand:
  • NVIDIA GPU Operator
  • device plugins
  • MIG / vGPU strategies (where applicable)
  • Ensure Kubernetes scheduling decisions align with real ML workload behavior, not just resource requests.

3. Platform Extensions & Controllers

  • Develop custom controllers/operators to:
  • manage cluster lifecycle
  • enforce policy and quotas
  • automate remediation (node drain, GPU quarantine, rescheduling)
  • Design internal APIs that abstract:
  • complex GPU and networking configurations
  • cluster upgrades and maintenance workflows

4. Multi-Tenancy, Isolation & Security

  • Design strong tnant isolation using:
  • namespaces, RBAC, admission controllers
  • network policies (CNI-level enforcement)
  • GPU and node-level isolation strategies
  • Work with security engineers to:
  • enforce least privilege
  • support enterprise compliance requirements
  • ensure auditability of platform actions

5. Observability, Reliability & Debuggability

  • Define observability standards for:
  • control-plane health
  • scheduling latency
  • GPU and noe lifecycle events
  • Expose clear signals to SRE and operations teams.
  • Ensure every platform action is traceable, debuggable , auditable.

Must-have skill:

  • Deep Kubernetes internals (scheduler, etcd, control plane)
  • Go-based controller development
  • GPU operators and device plugins
  • Distributed systems fundamentals

Good-to-Have Skills:

  • Experience with multi-node GPU environments
  • Hands-on experience with distributed training frameworks
  • Working knowledge of the NVIDIA ecosystem (TensorRT, Triton, NeMo)
  • Experience deploying and operating AI models at scale on Kubernetes clusters
  • Familiarity with Slurm or other workload schedulers

Qualifications Criteria:

  • B.E/B.Tech or any relevant degree

More Info

Job Type:
Industry:
Employment Type:

Job ID: 145596809

Similar Jobs