Search by job, company or skills

Ola

Cloud Engineer - II (Kubernetes & Cloud-Native)

new job description bg glownew job description bg glownew job description bg svg
  • Posted 7 days ago
  • Over 50 applicants

Job Description

We are looking for a Cloud Engineer with strong cloud-native and Kubernetes expertise to design, build, and operate large-scale container platforms. The ideal candidate will have deep knowledge of Kubernetes internals, hands-on experience with modern cloud-native toolchains, and strong operational discipline for running production workloads. Experience with multi-cluster environments, Cluster API, and GPU/ML workloads is a strong advantage.

Key Responsibilities

- Design, deploy, and operate Kubernetes clusters across multiple environments, including managed Kubernetes and Cluster APIbased provisioning.

- Manage and optimize Kubernetes control plane components, networking, storage layers, and scheduling behavior for high availability and performance.

- Implement and operate cloud-native tooling such as Helm, Operators/CRDs, GitOps workflows, container runtimes, and CI/CD pipelines.

- Build automation for cluster lifecycle management, scaling, upgrades, and multi-cluster governance.

- Collaborate with platform, ML, and application teams to support GPU workloads, model serving pipelines, and training/inference environments.

- Troubleshoot complex distributed system issues related to networking, storage, performance, and container orchestration.

- Develop automation and tooling using Go, Python, or Bash to streamline operations and improve platform reliability.

- Drive incident response and root cause analysis for production issues across Kubernetes and supporting services.

Core Competencies

- Kubernetes Architecture & Internals (control plane, CNI, CSI, scheduling)

- Cloud-Native Tooling (Helm, CRDs/Operators, GitOps, container runtimes)

- Multi-Cluster & Cluster Lifecycle Management (Cluster API, managed K8s)

- Linux & Networking Fundamentals (TCP/IP, namespaces, routing, firewalls)

- Distributed Systems & Reliability Engineering

- Scripting & Automation (Go, Python, Bash)

- GPU Workloads & ML Platform Operations (model serving, training pipelines)

Professional Experience Highlights

- Designed and operated production-grade Kubernetes clusters supporting large-scale microservices, data platforms, or ML workloads.

- Implemented multi-cluster automation frameworks using Cluster API, GitOps, and declarative lifecycle management.

- Built or maintained Kubernetes Operators, Helm charts, or CRDs to automate cloud-native workflows.

- Supported GPU-accelerated workloads, including ML model inference, training jobs, or high-performance compute pipelines.

- Led incident response and deep-dive debugging for complex issues involving Kubernetes networking, storage, or control plane failures.

- Developed internal tools and automation for cluster provisioning, monitoring, upgrades, and compliance.

Technologies & Tools

- Kubernetes Helm Operators/CRDs Cluster API GitOps (Argo CD / Flux) Container Runtimes (containerd, CRI-O) Go Python Bash Linux Networking Prometheus/Grafana GPU Workloads (NVIDIA stack)

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 141472153