About Salvo Software
Salvo Software is a global firm that provides cost-effective software solutions to guide enterprises and startups through digital transformation. With distributed teams across the US, LATAM, and India, we partner with clients to build high-performance, scalable systems that solve complex technical challenges. Our culture values innovation, ownership, and engineering excellence.
Role Overview
We are seeking a highly skilled AI Developer with a strong backend and machine learning engineering background to design, train, optimize, and deploy LLM models in on-prem and offline environments. This role is deeply technical and hands-on, requiring expertise across Python ML stacks, model optimization, local inference frameworks, RAG (Retrieval-Augmented Generation) architectures, MCP (Model Context Protocol) integrations, and DevOps workflows tailored for offline systems.
You will work closely with our engineering and product teams to build end-to-end LLM pipelines — including data preprocessing, supervised fine-tuning, model quantization, evaluation, RAG pipeline design, and deployment using local or air-gapped infrastructure. If you enjoy working with cutting-edge open-source LLMs, building context-aware AI systems, and designing reliable backend pipelines, this role is for you.
Key Responsibilities
Core LLM Development
- Train and fine-tune LLMs using supervised fine-tuning (SFT)
- Work with open-source models such as LLaMA, Mistral, Qwen, and similar architectures
- Build LoRA / Q-LoRA pipelines for efficient fine-tuning
- Implement and optimize data preprocessing workflows, including tokenization and long-context handling
- Use and extend Hugging Face Transformers & Datasets for training and inference
- Parse and process structured and semi-structured data, including XML/XSD files
- Implement document parsing solutions for Office formats (python-docx, OpenXML)
RAG & Context-Aware Systems
- Design and implement end-to-end Retrieval-Augmented Generation (RAG) pipelines for document-grounded question answering and knowledge retrieval
- Build and maintain vector stores and embedding pipelines using tools such as FAISS, Chroma, Weaviate, or pgvector
- Optimize retrieval strategies including hybrid search, re-ranking, and chunking approaches tailored for domain-specific corpora
- Develop and maintain MCP (Model Context Protocol) server integrations to enable LLMs to interact dynamically with tools, APIs, and external data sources
- Design agentic workflows that leverage MCP to give models structured access to internal systems and context in a controlled, auditable manner
Offline / On-Prem Model Expertise
- Deploy, run, and maintain models fully offline and in air-gapped environments
- Perform model optimization and quantization (GGUF, GPTQ, AWQ, bitsandbytes)
- Build and maintain inference systems using frameworks like vLLM, TGI, and Ollama
- Optimize GPU usage (CUDA, cuDNN, VRAM-aware batching)
- Maintain local CI/CD pipelines for ML models without cloud dependencies
- Manage local model registries, versioning, and artifacts
- Ensure RAG and MCP components are fully operational in offline and restricted network environments
Backend & DevOps
- Build backend services in Python for ML training and inference workflows
- Work with relational databases (Postgres/MySQL) and vector databases for RAG storage layers
- Use Docker and Git for reliable development and deployment pipelines
- Use Azure DevOps for CI/CD, including local runners when applicable
Requirements
Technical Skills
- Strong experience in Python for backend and ML development
- Expertise with ML frameworks such as PyTorch or TensorFlow, scikit-learn, and pandas
- Solid knowledge of Postgres or MySQL for data storage
- Experience with Docker, Git, and DevOps best practices
- Hands-on expertise with LLM training, fine-tuning, and optimization
- Experience with Hugging Face Transformers & Datasets
- Familiarity with XML/XSD and Office document parsing tools
- Experience deploying models with vLLM, TGI, or Ollama
- Understanding of quantization techniques (GGUF/GPTQ/AWQ)
- Experience working with GPU optimization and the CUDA stack
- Ability to build solutions for offline, on-prem, and air-gapped environments
- Hands-on experience designing and implementing RAG pipelines, including embedding models, vector stores (FAISS, Chroma, Weaviate, or pgvector), and retrieval optimization strategies
- Experience building or integrating MCP (Model Context Protocol) servers to connect LLMs with external tools, APIs, and structured data sources
Nice to Have
- Experience building agentic systems using MCP in production or near-production environments
- Familiarity with advanced RAG techniques such as HyDE, re-ranking, or multi-hop retrieval
- Experience managing ML model registries in offline environments
- Familiarity with AWS for hybrid deployments
- Experience with secure environments, restricted networks, or enterprise compliance requirements
Soft Skills
- Strong ownership mindset and problem-solving ability
- Ability to work effectively in distributed teams across time zones
- Clear communication when discussing complex technical topics with both technical and non-technical stakeholders