Search by job, company or skills

Unified Consultancy Services

Platform Engineer ( Site Reliability / DevOps Engineer)

8-10 Years
Save
  • Posted 17 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Job Description

Responsibilities/What You'll Do:

  • Platform Design and Architecture: building and operating a highly available, scalable, modular AI platform using technologies such as Qdrant, Anyscale, and Ray to support LLM orchestration, vector search, and multi-agent frameworks.
  • Core Infrastructure Development: Build essential APIs and infrastructure to power conversational applications, AI agents, and analytics tools.
  • LLM Operational Solutions: Implement workflows for Large Language Models, including inference pipelines, fine tuning, caching, and evaluation for open-weight and hosted models.
  • Deployment & Performance Optimization: Deploy AI services on AWS with Kubernetes (EKS), Lambda, and ECS, ensuring scalability and resilience while optimizing vector databases and model runtimes for cost and performance.
  • Collaboration, Governance, & Mentorship: Partner with engineering teams, research teams to deliver production grade, self-healing, and performance-optimized services for AI/RAG pipelines , establish governance/security standards, and mentoring junior engineers in AI infrastructure best practices & reviews.

What We're Looking For (Minimum Qualifications)

  • 8+ years of experience as Platform Engineer ( Site Reliability / DevOps Engineer) , with at least 3+ years in AI/ML platform development ( MLOps ).
  • Deep expertise in Python, with strong design and debugging skills.
  • Ability to work independently and lead complex projects with Excellent problem-solving, analytical, and communication skills.
  • Proficiency working with cloud platforms such as AWS, GCP, or Azure and familiarity with MLOps/AI DevOps tools like MLflow or Kubeflow, proficient in CI/CD , infrastructure as code (Terraform / CloudFormation).
  • Hands-on expertise with CI/CD pipelines, model observability, and incident response for AI/ML services.

Preferred Qualification

  • Experience implementing and optimizing Platforms supporting large language model (LLM) pipelines with frameworks such as LangChain, LlamaIndex, Hugging Face Transformers, or similar.
  • Hands-on knowledge of Scaling & Setting up Vector DB platforms such as Qdrant (or other vector DBs like Pinecone, Weaviate) for semantic search and embeddings management.
  • Exposure to MLOps tools, Ray.io , Anyscale or other distributed orchestration & inference frameworks.
  • Experience with developing and deploying containerized applications using Docker and Kubernetes, including Helm charts and automated scaling.
  • Understanding of LLMOps patterns — model registry, prompt versioning, and feedback loop

More Info

Job Type:
Industry:
Employment Type:

Job ID: 148921433