About the Role
We are looking for a Model Compression and Quantization Engineer to design, build, and support optimized AI models for lower-cost, faster, and edge-ready inference. The ideal candidate will collaborate with business, data, and engineering teams to deliver secure, scalable, and measurable AI solutions while improving model efficiency and deployment performance.
Key Responsibilities
- Design and implement model compression strategies to optimize AI models for production environments.
- Apply quantization and pruning techniques to reduce model size and improve inference speed.
- Convert and optimize models using ONNX and TensorRT for deployment across various platforms.
- Perform model benchmarking to evaluate latency, throughput, memory usage, and accuracy trade-offs.
- Develop and maintain optimization pipelines using Python and AI frameworks.
- Collaborate with data scientists, ML engineers, and business stakeholders to deliver efficient AI solutions.
- Support deployment of optimized models for cloud, edge, and embedded environments.
- Monitor model performance and recommend improvements for scalability and cost optimization.
- Ensure AI solutions comply with security, reliability, and governance standards.
Required Skills
- Strong understanding of model quantization techniques
- Experience with model pruning and compression methods
- Hands-on experience with ONNX and TensorRT
- Expertise in model benchmarking and performance optimization
- Proficiency in Python
- Understanding of AI model deployment and inference optimization
Experience Requirements
- Up to 5 years of overall experience
- Minimum 1–2 years of relevant hands-on experience in model optimization, compression, quantization, or related AI technologies