Search by job, company or skills

  • Posted 2 days ago
  • Be among the first 20 applicants
Early Applicant

Job Description

Role Overview

Lead the DevOps and infrastructure team as both a technical leader and hands-on individual contributor, managing the company's growing cloud and on-premise resources with exceptional reliability and performance. You'll be responsible for maintaining 99% uptime for our high-throughput AdTech platform while optimizing costs and building a world-class infrastructure team.

Key Responsibilities

  • Maintain 99% uptime and meet SLAs across all environments while reducing infrastructure costs by 20-30%
  • Design and implement deployment architecture for high-throughput systems (25,000-30,000 QPS, sub-100ms latency)
  • Manage multi-cloud infrastructure (AWS, DigitalOcean, GCP) using Infrastructure as Code
  • Build CI/CD pipelines, monitoring systems, and automation for distributed microservices
  • Troubleshoot production issues including Kafka lag, RabbitMQ failures, Nodejs, Python and Java application performance
  • Lead incident response (on-call rotation), post-mortems, and implement preventive measures
  • Implement security best practices (OAuth, OIDC, SSO) and disaster recovery protocols
  • Build and mentor a team of infrastructure engineers

Required Skills & Experience

Experience: 5+ years in DevOps/Infrastructure roles, including 2+ years with high-throughput systems (10,000+ QPS)

Infrastructure & Cloud (MUST HAVE)

  • Strong production experience with Infrastructure as Code (Terraform, Terragrunt, Ansible)
  • Production Kubernetes and Docker experience with complex microservices architectures
  • Multi-cloud expertise: AWS (VPC, EC2, ECS, Fargate, S3, Glacier, RDS, Route 53, CloudFront, Lambda, API Gateway, CloudWatch), DigitalOcean, Azure, or GCP
  • Advanced Linux system administration (RHEL, Ubuntu, Amazon Linux) and networking concepts

Data Systems (Added Advantage)

  • ClickHouse: Production operations, query optimization, data retention policies for billions of auction records
  • Kafka: Consumer/producer optimization, lag management, performance tuning for high-volume message streams (millions of messages/day)
  • RabbitMQ: Message routing, cluster management, troubleshooting connection failures in K8s environments
  • MySQL: Database administration, replication, backup/recovery
  • Elasticsearch: Bulk indexing optimization, cluster health management

Development & CI/CD

  • CI/CD tools: GitHub Actions, Jenkins, GitLab CI, or similar
  • Programming: Python (required), Shell scripting (required); Rust or Go strongly preferred
  • JVM troubleshooting: Profiling, GC tuning, memory leak detection, understanding Java Spring Boot applications
  • Microservices architectures and API design patterns
  • Software development lifecycle and agile methodologies

Monitoring & Observability

  • Prometheus, Grafana, ELK stack (Elasticsearch, Logstash, Kibana, Filebeat)
  • System performance troubleshooting under load (CPU bottlenecks, memory leaks, network latency)
  • Incident response and production support with systematic debugging approach
  • Understanding of RED metrics (Rate, Errors, Duration) and USE metrics (Utilization, Saturation, Errors)

Nice to Have (Strong Bonus)

AdTech & Domain Knowledge

  • Experience with programmatic advertising and Real-Time Bidding (RTB) systems
  • Understanding of ad auction mechanics and sub-100ms latency requirements
  • Familiarity with ad fraud prevention and transparency measures
  • Knowledge of supply-side platforms (SSP) and demand-side platforms (DSP)

Blockchain & Distributed Systems

  • Blockchain infrastructure and node operations (Sui ecosystem experience is a major bonus)
  • Experience with decentralized storage systems (Walrus, IPFS, Arweave)
  • Data pipeline integration between blockchain and distributed storage
  • Understanding of consensus mechanisms and distributed ledger technology

Advanced Technical Skills

  • Rust or Go programming experience
  • MLOps practices and tooling
  • Security systems implementation (OAuth 2.0, OIDC, SSO with Okta/Auth0)
  • Data lifecycle management and GDPR/privacy compliance awareness
  • Experience with high-frequency trading or financial systems
  • Start-up or R&D environments with rapid iteration
  • Relevant cloud certifications (AWS Certified DevOps Engineer Professional, CKA, CKAD)

More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 142012331

Similar Jobs

(estd)