AI Engineer – Vision Language Models & Agentic Systems (2–3 Years Experience)
About the Role
We're looking for an AI Engineer with 2–3 years of hands-on experience in AI/ML, particularly in deploying and optimizing Vision Language Models (VLMs), Large Language Models (LLMs), or related generative AI systems.
You'll build and deploy vision-language systems that run efficiently across the full hardware spectrum—from constrained edge devices to multi-GPU servers. You'll own the model loading, optimization, and inference stack while building agentic AI systems that transform visual understanding into actionable intelligence.
Our deployments are fully self-hosted, often air-gapped, and run directly on bare metal—no cloud dependencies. This is a hands-on role for someone who enjoys working close to the hardware, optimizing GPU performance, and designing intelligent AI systems that solve real-world problems.
What You'll Do
- Deploy and optimize Vision Language Models (VLMs) for production inference across edge accelerators, embedded GPUs, and datacenter-class GPUs.
- Own the end-to-end model loading and inference pipeline, including quantization, memory budgeting, KV-cache management, and throughput/latency optimization.
- Build AI agents that combine visual perception with tool use, retrieval, and structured reasoning to automate complex workflows.
- Design, benchmark, and optimize inference-serving strategies, including batching, process isolation, threading, and independent CUDA contexts.
- Deploy and maintain AI systems in fully offline, bare-metal, and air-gapped environments without relying on cloud services.
- Port and optimize models across runtimes such as PyTorch/Transformers, vLLM, and ONNX Runtime with quantization-aware deployment strategies.
- Profile and troubleshoot GPU-level performance issues, including VRAM utilization, CUDA kernels, precision tradeoffs, and runtime bottlenecks.
- Integrate AI outputs with downstream systems such as databases, vector search, analytics platforms, and business workflows.
What We're Looking For
- 2–3 years of hands-on experience building and deploying AI/ML systems, with practical experience working on VLMs, LLMs, or other generative AI models.
- Strong Python programming skills and experience deploying AI models in production.
- Solid understanding of GPU inference, including VRAM management, CUDA contexts, memory optimization, and performance tuning.
- Hands-on experience with model quantization techniques such as FP8, INT8 (SmoothQuant/W8A8), and 4-bit weight-only quantization methods like AWQ.
- Experience deploying models across a range of hardware, from resource-constrained edge devices to enterprise GPU servers.
- Experience working in bare-metal, on-premise, or air-gapped environments without cloud-managed infrastructure.
- Familiarity with inference frameworks such as Hugging Face Transformers, vLLM, and ONNX Runtime.
- Understanding of VLM-specific challenges, including vision encoder activations, image-token KV-cache growth and mixed-precision inference.
- Experience building AI agents with tool calling, orchestration, retrieval, and multi-step reasoning.
- Strong debugging and problem-solving skills with a focus on production-quality AI systems.
Nice to Have
- Experience deploying AI workloads on edge devices, including INT8 calibration and model compilation pipelines.
- Experience with computer vision pipelines such as YOLO, custom pre/post-processing, and NMS.
- Familiarity with vector databases and hybrid semantic + structured search.
- Strong benchmarking and profiling discipline with a focus on optimizing real production workloads.
- Experience working with SQL and analytics-backed systems.
- Contributions to open-source AI projects or experience building reusable AI frameworks and tooling.
Mail ID: [Confidential Information]