About the Role
We are looking for a high-octane Lead Cloud Infrastructure & DevSecOps Engineer to serve as the primary technical authority for our global infrastructure. In this Individual Contributor (IC) role, you will own the architectural integrity, security, and scalability of a multi-platform ecosystem supporting Android, iOS, Web, and Backend APIs, with a deep focus on AI/LLM services.
With 8+ years of experience, you are expected to operate with high autonomy—identifying infrastructure bottlenecks before they occur and automating security protocols into the very fabric of our code. You aren't just running DevOps; you are building a self-service, secure-by-default platform for the entire engineering team.
Key Responsibilities
Technical Ownership & DevSecOps Automation- End-to-End CI/CD: Architect and maintain hardened, high-speed pipelines for Mobile (Android/iOS), Web, and Backend services.
- Security as Code: Integrate automated security gates—SAST, DAST, and software composition analysis (SCA)—into every build to ensure a Shift-Left security posture.
- Release Integrity: Own the deployment lifecycle, implementing advanced release patterns (Blue-Green, Canary) and automated rollbacks to maintain 99.9% uptime.
Advanced Infrastructure & K8s Governance- Kubernetes Mastery: Design and manage production-grade K8s clusters, focusing on network policies, resource isolation, and pod security standards.
- Immutable Infrastructure: Lead the implementation of modular Infrastructure as Code (Terraform or Pulumi) to ensure global environment parity.
- Compute Optimization: Manage and optimize high-performance GPU clusters specifically for AI/LLM inference and training workloads.
Defense-in-Depth & Security Engineering- Zero Trust Architecture: Enforce strict identity-based access (IAM/RBAC) and secure service-to-service communication via service meshes.
- Secrets & Compliance: Architect secure credential lifecycles using HashiCorp Vault or cloud-native KMS; ensure the environment is audit-ready for SOC2/ISO standards.
- Proactive Hardening: Perform regular vulnerability assessments, dependency audits, and infrastructure threat modeling.
Observability & AI Ops- Full-Stack Monitoring: Build and maintain high-fidelity observability stacks (Prometheus, Grafana, Datadog) to track application health and GPU metrics.
- AI Infrastructure: Scale GPU inference services, manage model versioning pipelines, and implement intelligent rate-limiting for LLM APIs.
- SRE & Resilience: Lead incident response for critical infrastructure issues and design automated, multi-region disaster recovery systems.
Required Skills
Technical Mastery- Experience: 8+ years in DevOps, SRE, or Infrastructure Engineering, with a deep focus on security.
- Platforms: Expert-level proficiency in AWS, GCP, or Azure.
- Orchestration: Deep hands-on experience with Kubernetes (K8s) and Docker.
- Mobile DevOps: Specific mastery of Android/iOS release automation (Fastlane, secure code signing, and App Store/Play Store distribution).
- IaC: Expert in Terraform or Pulumi.
- Languages: Strong proficiency in Python, Go, or Rust for building custom automation and infrastructure tooling.
Preferred & Bonus- AI/LLM Ops: Experience scaling GPU-based workloads and managing model deployment lifecycles.
- Real-time Systems: Familiarity with high-concurrency infrastructure such as WebRTC or LiveKit.
- Networking: Deep knowledge of VPCs, BGP, SSL/TLS, and Global Load Balancing.
Qualifications
- Bachelor's or Master's degree in Computer Science, Cybersecurity, or a related technical field.
- Proven track record as a Lead IC in a high-growth startup environment.
- Relevant certifications: CKS (Certified Kubernetes Security Specialist), AWS Certified Security Specialty, or equivalent.
- Mindset: You treat manual tasks as technical debt and view security as a foundational feature, not a final check.