At Shakudo, we're building the world's first operating system for data and AI. We use the term operating system in the truest sense: just like iOS, Windows, or Linux, Shakudo's end-to-end OS provides ever-evolving, fully automated, best-in-class open-source components tailored to each business's unique needs.
We are seeking a Senior DevOps Engineer to join our Engineering team and take ownership of deploying, configuring, and operating Shakudo in customer environments. This is a hands-on infrastructure role for someone who can work across Kubernetes, Helm charts, cloud and on-premise environments, and act as a trusted technical advisor to customers — diagnosing problems, designing deployment architectures, and ensuring Shakudo runs reliably in production.
In this role, you will own the deployment lifecycle from architecture to operations: assessing customer infrastructure, deploying Shakudo into complex environments, resolving production issues, and turning recurring problems into product improvements. This is not a traditional internal DevOps role — it is a mix of DevOps engineering, Kubernetes platform engineering, and solution architecture where success is measured by deployment reliability, customer satisfaction, and operational excellence.
Responsibilities
- Own the deployment and operation of Shakudo across customer Kubernetes environments
- Design, develop, customize, and troubleshoot Helm charts for complex production deployments
- Work deeply with Kubernetes primitives including deployments, stateful sets, services, ingress, storage classes, secrets, config maps, RBAC, network policies, CRDs, and operators
- Debug Kubernetes issues across scheduling, networking, storage, permissions, DNS, ingress, certificates, and workload reliability
- Build repeatable deployment patterns that work across different customer infrastructure environments
- Assess customer infrastructure and recommend the right deployment architecture for Shakudo
- Work with customer platform, DevOps, security, and infrastructure teams to deploy Shakudo into their environments
- Support deployments across AWS, GCP, Azure, hybrid cloud, and on-premise Kubernetes clusters
- Design for enterprise constraints such as private networking, IAM/RBAC, security controls, observability, compliance requirements, and restricted environments
- Help customers make the right trade-offs across reliability, scalability, performance, cost, and operational complexity
- Build and maintain infrastructure-as-code using tools such as Terraform and related cloud-native tooling
- Operate cloud managed services that interface with Shakudo Kubernetes clusters, including databases, storage, networking, secrets, and identity services
- Support GPU infrastructure and specialized compute environments for data and AI workloads
- Improve deployment automation, release processes, upgrade workflows, monitoring, and operational runbooks
- Identify recurring deployment issues and turn them into product improvements, automation, or reusable patterns
- Monitor, debug, and resolve production issues in customer environments
- Lead root-cause analysis for infrastructure, deployment, and platform reliability issues
- Execute product upgrades, maintenance windows, rollouts, and customer-specific configuration changes
- Improve observability, alerting, logging, and operational visibility across deployments
- Ensure customer environments are stable, secure, scalable, and maintainable
- Act as a trusted technical advisor to customers during deployment and production operations
- Explain infrastructure decisions clearly to both technical and non-technical stakeholders
- Collaborate with Solution Engineering, Product Engineering, and Customer Engineering teams to translate customer requirements into robust deployment architectures
- Document deployment designs, customer-specific configurations, best practices, and troubleshooting guides
- Represent the voice of the customer internally and influence product and platform improvements
Qualifications
- 5+ years of experience in DevOps, Platform Engineering, Infrastructure Engineering, SRE, or a related role
- Strong hands-on experience with Kubernetes in production environments
- Strong hands-on experience developing, maintaining, and troubleshooting Helm charts
- Experience deploying and operating software in customer or enterprise environments
- Experience with cloud platforms such as AWS, GCP, or Azure
- Experience with infrastructure-as-code tools such as Terraform
- Strong understanding of Kubernetes networking, storage, ingress, RBAC, secrets management, observability, and cluster operations
- Ability to troubleshoot complex infrastructure issues across application, Kubernetes, cloud, and network layers
- Familiarity with Python, Go, Bash, or TypeScript for automation and tooling
- Strong communication skills and comfort working directly with customer technical teams
- Ability to operate independently, make sound technical decisions, and drive deployments to completion
A Plus
- Experience with data platforms, AI infrastructure, MLOps, or GPU workloads
- Experience with Kubernetes operators, CRDs, GitOps, Argo CD, Flux, or similar deployment tooling
- Experience with enterprise security requirements, private networking, identity providers, SSO, and compliance-driven environments
- Experience deploying software into air-gapped, restricted, or customer-managed infrastructure
- Prior experience in a customer-facing infrastructure, solution engineering, or solution architecture role
- Contributions to open-source Kubernetes, DevOps, or infrastructure projects
Why Shakudo Stands Out
- Work with cutting-edge technologies in machine learning and high-performance computing
- Contribute to a platform that transforms how organizations leverage data and AI
- Join a dynamic team that values innovation, efficiency, and diversity
Shakudo offers a high-impact package: competitive salary, meaningful equity so you share in the upside of transformational technology, and top-tier health benefits that have you fully covered. We provide 16+ weeks of pregnancy leave top-up and a flexible vacation policy—because building transformational technology requires supporting the people who build it. More importantly, you'll work on technology that matters.
This is a work from office role based out of Bangalore (HSR Layout). Shakudo has offices in Toronto, San Francisco, and Bangalore.