We are seeking a skilled and experienced Platform Engineer/Architect to lead the setup, advancement and maintenance of a robust on-premise environment for hosting open-source large language models. This role involves designing and implementing scalable, secure, and efficient infrastructure solutions that cater to the specific needs of large-scale AI models.
HOW YOU WILL CONTRIBUTE AND WHAT YOU WILL LEARN
- Design and architect a scalable and secure on-premise hosting environment for large language models.
- Develop and implement infrastructure automation tools for efficient management and deployment.
- Ensure high availability and disaster recovery capabilities.
- Optimize the hosting environment for maximum performance and efficiency.
- Implement monitoring tools to track system performance and resource utilization.
- Regularly update the infrastructure to incorporate the latest technological advancements.
- Establish robust security protocols to protect sensitive data and model integrity.
- Ensure compliance with data protection regulations and industry standards.
- Conduct regular security audits and vulnerability assessments.
- Work closely with AI/ML teams to understand their requirements and provide suitable infrastructure solutions.
- Provide technical guidance and support to internal teams and stakeholders.
- Stay abreast of emerging trends in AI infrastructure and large language model hosting.
- Manage physical and virtual resources to ensure optimal allocation and utilization.
- Forecast resource needs and plan for future expansion and upgrades
KEY SKILLS AND EXPERIENCE
- Bachelor's or Master's degree in Computer Science, Information Technology, or a related field with 7-12 years of experience.
- Proven experience in infrastructure architecture, with exposure to AI/ML environments.
- Experience with inferencing frameworks like TGI, TEI, Lorax, S-Lora etc.
- Experience with training frameworks like PyTorch, TensorFlow etc.
- Proven experience with On-premises OSS models – Llama3, Mistral etc.
- Strong knowledge of networking, storage, and computing technologies.
- Experience of working with container orchestration tools (e.g., Kubernetes - Redhat OS).
- Proficient programming skills in Python
- Familiarity with open-source large language models and their hosting requirements.
- Excellent problem-solving and analytical skills.
- Strong communication and collaboration abilities.