About Knowdis.ai
KnowDis AI is a research-driven startup building cutting-edge solutions in Natural Language Processing (NLP) and Generative AI. Our AI tools are designed to automate and enhance business processes, including content generation, information retrieval, and document understanding. We work with clients across industries such as e-commerce, fashion, banking, and pharma, helping them unlock value through enterprise-ready AI solutions.
Role overview
We are seeking a Lead DevOps / Cloud Engineer to own and scale the production infrastructure that powers high-availability AI workloads and real-time search experiences.
This role will lead the design, reliability, and automation of cloud infrastructure, CI/CD systems, and production operations for a high-traffic platform serving enterprise-grade deployments. The ideal candidate combines strong cloud architecture expertise with deep operational ownership, enabling reliable GPU-backed AI model serving and low-latency distributed systems at scale.
You will work closely with backend, platform, and ML engineering teams to ensure resilient, observable, and cost-efficient infrastructure capable of supporting rapid experimentation as well as mission-critical production workloads.
Key responsibilities
- End-to-end infrastructure ownership: Design, deploy, and operate scalable cloud infrastructure supporting a high-scale multimodal search platform handling production traffic across text, image, and voice workloads.
- Kubernetes platform engineering: Architect and manage Kubernetes-based production environments (EKS/GKE/AKS) with robust autoscaling, failover mechanisms, and zero-downtime deployment practices.
- Infrastructure as Code (IaC): Design and maintain reproducible infrastructure using Terraform or equivalent tools to ensure secure, version-controlled environments across development, staging, and production.
- Reliability & performance engineering: Implement high-availability and low-latency architectures capable of handling traffic spikes through intelligent caching, queuing, rate limiting, load balancing, and graceful degradation strategies.
- CI/CD & release automation: Build and maintain automated CI/CD pipelines enabling safe, repeatable, and rapid deployments with rollback capabilities and environment parity.
- Disaster recovery & incident management: Establish redundancy, backup, and disaster recovery systems aligned with defined SLOs/SLAs; lead incident response practices including runbooks and post-incident reviews.
- Observability & monitoring: Implement comprehensive observability covering metrics, logs, distributed tracing, alerting, and performance monitoring; drive continuous reliability improvements through data-driven insights.
- AI infrastructure & model serving: Support GPU-based inference infrastructure and AI model deployment pipelines (e.g., Triton, vLLM, TGI), collaborating closely with MLOps and ML teams to ensure reliable, scalable model serving under production workloads.
- Security & operational excellence: Enforce cloud security best practices, access controls, secrets management, and infrastructure governance aligned with enterprise deployment standards.
- Cost optimization: Continuously monitor and optimize cloud resource utilization, GPU workloads, and infrastructure spend without compromising performance or reliability.
Required skills & experience
- 610 years of experience in DevOps, Site Reliability Engineering (SRE), or Cloud Platform Engineering roles within high-scale technology environments.
- Strong hands-on expertise with at least one major cloud platform AWS, GCP, or Azure including networking, compute, storage, and managed Kubernetes services.
- Deep experience operating production Kubernetes environments at scale, including autoscaling, cluster upgrades, workload orchestration, and resilience design.
- Proven experience implementing Infrastructure as Code using Terraform (preferred) or equivalent tooling.
- Strong understanding of distributed systems reliability, including load balancing, caching strategies, asynchronous queues, and failure recovery patterns.
- Experience designing and managing CI/CD pipelines using modern tooling (GitHub Actions, GitLab CI, ArgoCD, Jenkins, or equivalent).
- Hands-on experience building observability stacks using tools such as Prometheus, Grafana, ELK/OpenSearch, Datadog, or OpenTelemetry.
- Experience supporting GPU workloads and AI inference systems, including containerized model deployment and performance optimization for production ML systems.
- Familiarity with AI model serving frameworks such as Triton Inference Server, vLLM, TGI, or similar platforms is strongly preferred.
- Strong scripting and automation skills (Python, Bash, or Go preferred).
- Solid understanding of networking, security best practices, secrets management, and cloud cost optimization strategies.
- Experience working in fast-moving startup or scale-up environments with high ownership expectations.
Qualifications
- Bachelor's or Master's degree in Computer Science, Engineering, or a related technical discipline.
- Relevant cloud certifications (AWS/GCP/Azure, Kubernetes CKA/CKAD) are preferred but not mandatory.
- Demonstrated experience supporting production systems with defined uptime, latency, and reliability targets.
Why join KnowDis.ai
You will help build foundational infrastructure powering advanced multimodal AI systems used in real-world production environments. This role offers deep technical ownership, exposure to cutting-edge AI workloads, and the opportunity to shape platform reliability from the ground up within a fast-growing AI company.
Selection Process:
- Interested Candidates are mandatorily required to apply through this listing on Jigya. Only applications received through this posting will be evaluated further.
- Shortlisted candidates may be required to appear in a Screening interview administered by Jigya
- Candidates selected after the Jigya screening rounds will be interviewed by KnowDis