AI Engineer Vision Language Models & Agentic Systems

intozi

Gurugram, Gurugram, India

2-4 Years

Save

Posted 2 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

AI Engineer – Vision Language Models & Agentic Systems (2–3 Years Experience)

About the Role

We're looking for an AI Engineer with 2–3 years of hands-on experience in AI/ML, particularly in deploying and optimizing Vision Language Models (VLMs), Large Language Models (LLMs), or related generative AI systems.

You'll build and deploy vision-language systems that run efficiently across the full hardware spectrum—from constrained edge devices to multi-GPU servers. You'll own the model loading, optimization, and inference stack while building agentic AI systems that transform visual understanding into actionable intelligence.

Our deployments are fully self-hosted, often air-gapped, and run directly on bare metal—no cloud dependencies. This is a hands-on role for someone who enjoys working close to the hardware, optimizing GPU performance, and designing intelligent AI systems that solve real-world problems.

What You'll Do

Deploy and optimize Vision Language Models (VLMs) for production inference across edge accelerators, embedded GPUs, and datacenter-class GPUs.
Own the end-to-end model loading and inference pipeline, including quantization, memory budgeting, KV-cache management, and throughput/latency optimization.
Build AI agents that combine visual perception with tool use, retrieval, and structured reasoning to automate complex workflows.
Design, benchmark, and optimize inference-serving strategies, including batching, process isolation, threading, and independent CUDA contexts.
Deploy and maintain AI systems in fully offline, bare-metal, and air-gapped environments without relying on cloud services.
Port and optimize models across runtimes such as PyTorch/Transformers, vLLM, and ONNX Runtime with quantization-aware deployment strategies.
Profile and troubleshoot GPU-level performance issues, including VRAM utilization, CUDA kernels, precision tradeoffs, and runtime bottlenecks.
Integrate AI outputs with downstream systems such as databases, vector search, analytics platforms, and business workflows.

What We're Looking For

2–3 years of hands-on experience building and deploying AI/ML systems, with practical experience working on VLMs, LLMs, or other generative AI models.
Strong Python programming skills and experience deploying AI models in production.
Solid understanding of GPU inference, including VRAM management, CUDA contexts, memory optimization, and performance tuning.
Hands-on experience with model quantization techniques such as FP8, INT8 (SmoothQuant/W8A8), and 4-bit weight-only quantization methods like AWQ.
Experience deploying models across a range of hardware, from resource-constrained edge devices to enterprise GPU servers.
Experience working in bare-metal, on-premise, or air-gapped environments without cloud-managed infrastructure.
Familiarity with inference frameworks such as Hugging Face Transformers, vLLM, and ONNX Runtime.
Understanding of VLM-specific challenges, including vision encoder activations, image-token KV-cache growth and mixed-precision inference.
Experience building AI agents with tool calling, orchestration, retrieval, and multi-step reasoning.
Strong debugging and problem-solving skills with a focus on production-quality AI systems.

Nice to Have

Experience deploying AI workloads on edge devices, including INT8 calibration and model compilation pipelines.
Experience with computer vision pipelines such as YOLO, custom pre/post-processing, and NMS.
Familiarity with vector databases and hybrid semantic + structured search.
Strong benchmarking and profiling discipline with a focus on optimizing real production workloads.
Experience working with SQL and analytics-backed systems.
Contributions to open-source AI projects or experience building reusable AI frameworks and tooling.

Mail ID: [Confidential Information]