
Search by job, company or skills
We are looking for a Cloud Engineer with strong cloud-native and Kubernetes expertise to design, build, and operate large-scale container platforms. The ideal candidate will have deep knowledge of Kubernetes internals, hands-on experience with modern cloud-native toolchains, and strong operational discipline for running production workloads. Experience with multi-cluster environments, Cluster API, and GPU/ML workloads is a strong advantage.
Key Responsibilities
- Design, deploy, and operate Kubernetes clusters across multiple environments, including managed Kubernetes and Cluster APIbased provisioning.
- Manage and optimize Kubernetes control plane components, networking, storage layers, and scheduling behavior for high availability and performance.
- Implement and operate cloud-native tooling such as Helm, Operators/CRDs, GitOps workflows, container runtimes, and CI/CD pipelines.
- Build automation for cluster lifecycle management, scaling, upgrades, and multi-cluster governance.
- Collaborate with platform, ML, and application teams to support GPU workloads, model serving pipelines, and training/inference environments.
- Troubleshoot complex distributed system issues related to networking, storage, performance, and container orchestration.
- Develop automation and tooling using Go, Python, or Bash to streamline operations and improve platform reliability.
- Drive incident response and root cause analysis for production issues across Kubernetes and supporting services.
Core Competencies
- Kubernetes Architecture & Internals (control plane, CNI, CSI, scheduling)
- Cloud-Native Tooling (Helm, CRDs/Operators, GitOps, container runtimes)
- Multi-Cluster & Cluster Lifecycle Management (Cluster API, managed K8s)
- Linux & Networking Fundamentals (TCP/IP, namespaces, routing, firewalls)
- Distributed Systems & Reliability Engineering
- Scripting & Automation (Go, Python, Bash)
- GPU Workloads & ML Platform Operations (model serving, training pipelines)
Professional Experience Highlights
- Designed and operated production-grade Kubernetes clusters supporting large-scale microservices, data platforms, or ML workloads.
- Implemented multi-cluster automation frameworks using Cluster API, GitOps, and declarative lifecycle management.
- Built or maintained Kubernetes Operators, Helm charts, or CRDs to automate cloud-native workflows.
- Supported GPU-accelerated workloads, including ML model inference, training jobs, or high-performance compute pipelines.
- Led incident response and deep-dive debugging for complex issues involving Kubernetes networking, storage, or control plane failures.
- Developed internal tools and automation for cluster provisioning, monitoring, upgrades, and compliance.
Technologies & Tools
- Kubernetes Helm Operators/CRDs Cluster API GitOps (Argo CD / Flux) Container Runtimes (containerd, CRI-O) Go Python Bash Linux Networking Prometheus/Grafana GPU Workloads (NVIDIA stack)
Job ID: 141472153