Overview
We are seeking a skilled Platform Engineer to join our team and drive the development, deployment, and supportability of our Kubernetes-based microservices platform, deployed on-premises by customers. You will build comprehensive observability, enable log and report extraction for service cases without real-time access, and optimize our overuse of Kafka by integrating Redis and batch processing. This role requires expertise in Kubernetes, Azure DevOps, C++ support, deployment sizing, and designing for reliability, availability, and serviceability (RAS).
Responsibilities
- Build Comprehensive Observability: Implement centralized metrics, logging, and tracing (e.g., Prometheus, Fluentd, OpenTelemetry) for .NET, Python, Java, C++, Kafka, and Redis, ensuring supportability in on-premises environments.
- Enable Log/Report Extraction: Design customer-facing tools (e.g., CLI scripts, Helm chart options) to collect and export logs/metrics from on-premises deployments for service cases, without real-time access.
- Optimize Kafka Usage: Audit and optimize Kafka configurations (e.g., topics, partitions, compression) to reduce metadata streaming overhead, monitored with Prometheus or Azure Monitor.
- Implement Alternatives: Integrate Redis (e.g., Azure Cache for Redis) for metadata caching/pub-sub and batch processing (e.g., Azure Data Factory, Kubernetes Jobs) for high-volume data, reducing Kafka dependency.
- Troubleshoot Customer Environments: Debug issues in on-premises customer deployments for services (C++, .NET, Python, Java), Kafka, and Redis, using exported logs and metrics.
- Enhance Product Supportability: Build Azure DevOps pipelines and installers (e.g., Helm charts) for consistent, supportable deployments, with documentation for customer support.
- Contribute to RAS: Own serviceability by building observability and diagnostic tools; support reliability/availability via Kubernetes optimization, autoscaling, and fault-tolerant designs.
- Enforce Standards: Implement and enforce structured logging (e.g., JSON with correlation IDs) and resource sizing standards via Azure DevOps pipelines.
- Optimize Deployment Sizing: Set Kubernetes resource requests/limits and autoscaling policies (e.g., HPA, VPA) for services, Kafka, Redis, and batch jobs, based on profiling.
- Evaluate Service Meshes: Assess service meshes (e.g., Linkerd) for improving microservice and data platform observability and communication.
- Support C++ Services: Assist developers in containerizing, deploying, and debugging C++ services, ensuring integration with observability, Kafka, Redis, or batch workflows.
- Automate with Azure DevOps: Build CI/CD pipelines in Azure DevOps for automated builds, tests, and deployments, integrating with AKS, Kafka, and Redis.
Qualifications
- Experience: 35 years with Kubernetes, Azure DevOps (AKS, pipelines), and Kafka administration.
- Technical Skills:
- Expert in Kubernetes (CKA/CKAD preferred) and Azure DevOps (YAML pipelines, AKS integration).
- Proficient in observability tools (e.g., Prometheus, Grafana, Fluentd, OpenTelemetry, Azure Monitor) for metrics, logs, and tracing.
- Experience with on-premises Kubernetes deployments and log/report extraction for service cases.
- Proficient in Kafka optimization (e.g., topic management, consumer groups) and monitoring.
- Knowledge of Redis (e.g., Azure Cache for Redis, pub/sub) and batch processing (e.g., Azure Data Factory, Kubernetes Jobs).
- Familiarity with C++ build systems (e.g., CMake) and debugging (e.g., gdb) in Kubernetes.
- Proficiency in Kubernetes resource management and autoscaling (e.g., HPA, VPA).
- Scripting skills (e.g., Python, Bash) for automation, diagnostics, and log extraction.
- Customer Focus: Proven ability to troubleshoot on-premises customer environments and build supportable deployment and observability tools.
- Standards Enforcement: Experience enforcing logging, sizing, and data platform standards via Azure DevOps pipelines.
- RAS Expertise: Ability to design for serviceability (observability, diagnostics) and contribute to reliability/availability through platform optimization.
Nice-to-Haves
- Experience with service meshes (e.g., Linkerd, Istio) and their integration with Azure.
- Familiarity with .NET, Python, or Java for developer collaboration.
- Knowledge of air-gapped Kubernetes deployments (e.g., Kubeadm, K3s).