Computer Vision Research Engineer

Sapien Robotics

India

3-5 Years

Save

Posted a day ago
Be among the first 10 applicants

Early Applicant

Job Description

Role Description

We are hiring a Senior Computer Vision Research Engineer to design and deploy scalable, low-latency video analytics systems for large-scale CCTV networks. Core focus: building the best-in-class Vision-Language Models (VLMs) optimized for edge deployment, enabling multimodal reasoning (VQA, semantic search, event description) in resource-constrained environments.

Key Responsibilities:

Architect end-to-end pipelines: MOT, Re-ID, action/anomaly detection, scene understanding.
Develop and optimize sub-2B parameter VLMs for edge (e.g., surpassing Moondream2/Qwen2-VL benchmarks) using QAT, PTQ, pruning, distillation, and efficient architectures.
Scale real-time processing of thousands of streams with sub-second latency.
Profile and resolve bottlenecks in video analytics and multimodal systems.
Optimize for edge hardware (Jetson, Coral, Hailo) via TensorRT/OpenVINO/TVM.
Design hybrid cloud-edge architectures and production monitoring.

Qualifications :

Minimum 3+ years of industry experience in developing and deploying computer vision systems for video analytics at scale.
Proven track record of production deployments across large-scale camera networks ,including full lifecycle from prototyping to monitoring.
Demonstrated expertise in building and optimizing Vision-Language Models (VLMs) for edge environments, with hands-on experience in architectures like unified embedding, cross-modality attention, or efficient variants (e.g., SmolVLM, LFM2-VL, MobileVLM).
Deep understanding of performance bottlenecks in contemporary video analytics and VLM systems (e.g., GPU/CPU saturation, PCIe bandwidth contention, codec latency, drift due to domain shift, high token counts in multimodal processing, and privacy-preserving inference).
Hands-on expertise in edge model optimization using TensorFlow Lite, ONNX Runtime, PyTorch Mobile, OpenVINO, TensorRT, or TVM, achieving 25x reductions in latency/memory while maintaining accuracy, including techniques for VLM compression like token pruning or multi-scale pooling.
Strong proficiency in Python/C++, with extensive experience in PyTorch/TensorFlow, OpenCV, CUDA, and distributed training/inference frameworks.
Solid foundation in modern CV architectures (Transformers, CNNs, hybrid models), real-time tracking algorithms (DeepSORT, ByteTrack, BoT-SORT), and VLM components (e.g., vision encoders like ViT, multimodal pre-training strategies).

What we offer:

Competitive compensation package with equity.
Comprehensive health benefits and flexible working arrangements.
Access to cutting-edge hardware, cloud credits, and conference attendance support.
Opportunity to shape the future of AI-powered physical security systems.