Search Jobs

Search by job, company or skills

yotta data services private limited

Devops Engineer- K8

yotta data services private limited

6-8 Years

Save

new job description bg glow

new job description bg glow

new job description bg svg

Posted 3 days ago
Be among the first 10 applicants

Early Applicant

Job Description

Job Scope:

Build and evolve Kubernetes as a core AI infrastructure platform.

Extending Kubernetes, not just operating it

Designing GPU-aware scheduling, isolation, and lifecycle management

Building reliable, multi-tenant AI clusters that do not break under extreme load

Total /Relevant Experience:

6 Plus years of experience

Key Responsibilities:

1. Kubernetes Platform Architecture

Design and evolve Kubernetes clusters optimized for:
GPU-heavy workloads
multi-node, gang-scheduled training jobs
long-running and high-throughput inference
Own control-plane architecture:
etcd sizing and tuning
API server scalability
scheduler performance under high churn
Define reference cluster architectures for:
dedicated training clusters
shared multi-tenant clusters

2. GPU-Aware Scheduling & Workload Semantics

Build or extend scheduling mechanisms for:
GPU topology awareness
NUMA and locality sensitivity
anti-affinity for noisy neighbors
Integrate and deeply understand:
NVIDIA GPU Operator
device plugins
MIG / vGPU strategies (where applicable)
Ensure Kubernetes scheduling decisions align with real ML workload behavior, not just resource requests.

3. Platform Extensions & Controllers

Develop custom controllers/operators to:
manage cluster lifecycle
enforce policy and quotas
automate remediation (node drain, GPU quarantine, rescheduling)
Design internal APIs that abstract:
complex GPU and networking configurations
cluster upgrades and maintenance workflows

4. Multi-Tenancy, Isolation & Security

Design strong tnant isolation using:
namespaces, RBAC, admission controllers
network policies (CNI-level enforcement)
GPU and node-level isolation strategies
Work with security engineers to:
enforce least privilege
support enterprise compliance requirements
ensure auditability of platform actions

5. Observability, Reliability & Debuggability

Define observability standards for:
control-plane health
scheduling latency
GPU and noe lifecycle events
Expose clear signals to SRE and operations teams.
Ensure every platform action is traceable, debuggable , auditable.

Must-have skill:

Deep Kubernetes internals (scheduler, etcd, control plane)
Go-based controller development
GPU operators and device plugins
Distributed systems fundamentals

Good-to-Have Skills:

Experience with multi-node GPU environments
Hands-on experience with distributed training frameworks
Working knowledge of the NVIDIA ecosystem (TensorRT, Triton, NeMo)
Experience deploying and operating AI models at scale on Kubernetes clusters
Familiarity with Slurm or other workload schedulers

Qualifications Criteria:

B.E/B.Tech or any relevant degree

More Info

Job Type:

Industry:

Function:

Ai Infrastructure

Employment Type:

About Company

yotta data services private limitedJob Source: www.linkedin.com

Job ID: 145596809

Jobs by Skill - IT

Jobs by Skill - Non IT

International Jobs

Last Updated: 10-04-2026 10:26:07 PM

Homejobs in MumbaiDevops Engineer- K8

Similar Jobs

Java Microservice Developer with AWS

O

8-12 yrs

Remote

MS Cloud Administrator - 7

Growel Softech Private Limited

Growel Softech Private Limited

2-7 yrs

Mumbai

Site Reliability Engineer III

Jp morgan

3-8 yrs

Navi Mumbai, Mumbai City, Mumbai

contract

Do you want to see more relevant and perfect job for you?

Beware of Scammers

Beware of Scammers

We don’t charge any money for job offers

Interview Calls

What it feels like to have

48% more interview calls?

To get 5X more recruiter views on your profile