Search by job, company or skills

CG-VAK Software & Exports Ltd.

AI Runtime Lead (LLM DevOps, PyTorch)

new job description bg glownew job description bg glownew job description bg svg
  • Posted 27 days ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Role & Responsibilities

As Lead/Staff AI Runtime Engineer, you'll play a pivotal role in the design, development, and optimization of the core runtime infrastructure that powers distributed training and deployment of large AI models (LLMs and beyond). This is a hands-on leadership role - perfect for a systems-minded software engineer who thrives at the intersection of AI workloads, runtimes, and performance-critical infrastructure. You'll own critical components of our PyTorch-based stack, lead technical direction, and collaborate across engineering, research, and product to push the boundaries of elastic, fault-tolerant, high-performance model execution.

What You'll Do

Lead Runtime Design & Development:

  • Own the core runtime architecture supporting AI training and inference at scale.
  • Design resilient and elastic runtime features (e.g. dynamic node scaling, job recovery) within our custom PyTorch stack.
  • Optimize distributed training reliability, orchestration, and job-level fault tolerance.

Drive Performance At Scale

  • Profile and enhance low-level system performance across training and inference pipelines.
  • Improve packaging, deployment, and integration of customer models in production environments.
  • Ensure consistent throughput, latency, and reliability metrics across multi-node, multi- GPU setups.

Build Internal Tooling & Frameworks

  • Design and maintain libraries and services that support model lifecycle: training, check pointing, fault recovery, packaging, and deployment.
  • Implement observability hooks, diagnostics, and resilience mechanisms for deep learning workloads.
  • Champion best practices in CI/CD, testing, and software quality across the AI Runtime stack.

Collaborate & Mentor

  • Work cross-functionally with Research, Infrastructure, and Product teams to align runtime development with customer and platform needs.
  • Guide technical discussions, mentor junior engineers, and help scale the AI Runtime team's capabilities.

Ideal Candidate

  • 5+ years of experience in systems/software engineering, with deep exposure to AI runtime, distributed systems, or compiler/runtime interaction.
  • Experience in delivering PaaS services.
  • Proven experience optimizing and scaling deep learning runtimes (e.g. PyTorch, TensorFlow, JAX) for large-scale training and/or inference.
  • Strong programming skills in Python and C++ (Go or Rust is a plus).
  • Familiarity with distributed training frameworks, low-level performance tuning, and resource orchestration.
  • Experience working with multi-GPU, multi-node, or cloud-native AI workloads.
  • Solid understanding of containerized workloads, job scheduling, and failure recovery inproduction environments.

Skills: ai runtime,software,llm,deep learning,infrastructure,pytorch,stack

More Info

Job Type:
Industry:
Employment Type:

Job ID: 138299793