About the Role
We are seeking an AI Inference Optimization Engineer to design, build, and support high-performance model-serving pipelines for scalable enterprise AI applications. The ideal candidate will work closely with business, data, and engineering teams to deliver secure, scalable, and measurable AI solutions while optimizing inference performance, resource utilization, and deployment efficiency.
Key Responsibilities
- Design and develop high-performance AI inference and model-serving pipelines.
- Optimize large language model inference using vLLM and TensorRT-LLM.
- Improve GPU utilization through batching, caching, and request scheduling techniques.
- Build scalable and reliable AI serving infrastructure for enterprise applications.
- Deploy and manage inference workloads using Kubernetes-based environments.
- Monitor system performance, latency, throughput, and infrastructure utilization.
- Collaborate with AI engineers, data scientists, platform teams, and business stakeholders.
- Implement observability, monitoring, and alerting solutions for AI services.
- Continuously improve inference efficiency, scalability, and cost optimization.
- Ensure security, reliability, and governance standards are followed throughout the AI lifecycle.
Required Skills
- Hands-on experience with vLLM
- Knowledge of TensorRT-LLM
- Strong understanding of GPU-based inference optimization
- Experience with batching and caching techniques
- Proficiency in Kubernetes
- Experience with monitoring and observability tools
- Understanding of scalable AI serving architectures
Experience Requirements
- Up to 5 years of overall experience
- Minimum 1–2 years of relevant hands-on experience in AI inference, model serving, MLOps, or related technologies