
Search by job, company or skills
Role: Principal Engineer – DevOps
Location: Bangalore
Department: Platform Engineering / DevOps
Experience: 8–12 Years
About Styli Marketplace
Launched in 2019 by Landmark Group, Styli Marketplace is the first e-commerce venture of the group, quickly becoming a leading online destination for fashion and lifestyle across the GCC, including Saudi Arabia, the UAE, Kuwait, Bahrain, and beyond. Styli connects global sellers and creators with millions of fashion-forward customers, offering the latest trends, exceptional value, and convenient services like same-day to 48-hour delivery and flexible payment options. Our mission is to make style accessible, aspirational, and exciting for all, backed by a passionate team fostering a culture of creativity and innovation. At Styli, we aim to revolutionize fashion retail and bring unique experiences to our customers.
Role Overview
We are looking for a senior, hands-on Principal Engineer – DevOps to set the technical direction and own the cloud infrastructure, container platform, and delivery pipelines that power Styli's commerce business on GCP. As the most senior individual contributor on the platform engineering team, you will combine architectural ownership with deep hands-on execution — writing Terraform, tuning GKE, debugging incidents, and driving step-changes in reliability, velocity, and cost. You will define the standards and golden paths that engineering teams build on, mentor senior engineers, and partner with leadership to scale Styli reliably through flash sales, regional expansion, and traffic peaks of 10x–20x baseline.
What You'll Do
Cloud Architecture & Infrastructure (GCP & AWS)
• Own end-to-end cloud architecture for Styli's commerce platform on GCP (primary)-
compute, networking, storage, data, and managed services across multiple regions and
projects.
• Architect multi-region, highly available environments with strong fault tolerance, disaster
recovery (RTO/RPO), and cost efficiency.
• Lead IAM strategy, organisation policy, VPC design, Shared VPC, VPC Service Controls,
Private Service Connect, Cloud Interconnect, and cross-cloud networking.
• Drive cloud cost optimisation (FinOps) — right-sizing, committed-use and Spot/preemptible
strategies, chargeback/showback, and continuous review of multi-million-dollar cloud spend.
• Manage cloud-native services at scale: GKE, Cloud Run, Cloud SQL, AlloyDB, Spanner,
Pub/Sub, Cloud CDN, Cloud Armor, Apigee, and equivalent AWS services where used.
Kubernetes & Container Orchestration
• Architect and operate production Kubernetes clusters on GKE (Standard and Autopilot) at
scale — multi-cluster, multi-region, with strong tenancy and security boundaries.
• Design workload scheduling strategies: HPA, VPA, KEDA, node pool strategy, topology
spread, PDBs, and graceful shutdown patterns for stateful and stateless services.
• Own GitOps delivery with ArgoCD or Flux, plus Helm and Kustomize patterns that scale across
dozens of services and environments.
• Enforce cluster security: RBAC, network policies, OPA/Gatekeeper or Kyverno admission
control, signed images (Cosign/Sigstore), Workload Identity, and secrets management
(GCP Secret Manager / HashiCorp Vault).
• Run a production service mesh (Anthos Service Mesh / Istio) for mTLS, traffic shaping,
progressive delivery, and L7 observability.
Scaling, Reliability & Performance Engineering
• Design and implement horizontal and vertical scaling strategies for traffic peaks — flash
sales, campaign launches, and seasonal events driving 10x–20x baseline traffic.
• Build and tune auto-scaling pipelines across cloud instances, Kubernetes workloads, and
managed services; validate behaviour end-to-end with load testing and game days.
• Define SLOs, SLIs, and error budgets for critical services; lead capacity planning with defined
headroom and continuous tuning.
• Drive observability across logs, metrics, traces, and events using Prometheus, Grafana,
OpenTelemetry, Cloud Operations Suite, and tools like Datadog or New Relic.
• Lead incident response — on-call structure, runbooks, blameless postmortems — and drive
systemic reliability improvements (chaos engineering, dependency hardening, elimination of
single points of failure).
CI/CD & Developer Experience
• Design and own fast, secure CI/CD pipelines using GitHub Actions, GitLab CI, Jenkins, or
Cloud Build — supporting microservices, containerised builds, and multi-environment
promotion.
• Build the internal developer platform and golden paths so engineering teams can ship safely
and quickly without re-solving infrastructure problems.
• Implement progressive delivery patterns: blue-green, canary releases, and feature flags with
automated rollback driven by SLO breach.
• Embed security gates in CI/CD — SAST, DAST, SCA, IaC scanning, container image
scanning, and secret scanning — with developer-friendly feedback loops.
• Drive build optimisation — caching strategies, parallelisation, and pipeline observability — to
reduce feedback cycle times.
Infrastructure as Code, Automation & Technical Leadership
• Author and maintain infrastructure using Terraform (primary) at scale — module design, state
management, policy-as-code (OPA/Checkov), and CI-integrated plan/apply workflows.
• Eliminate operational toil through automation in Python, Bash, or Go; build self-healing
systems and reproducible environments across dev, staging, and production.
• Set technical standards, reference architectures, and golden paths; lead architecture reviews
for new services and major platform changes.
• Mentor senior engineers and influence the broader engineering organisation on cloud,
container, and delivery practices — without becoming a bottleneck.
What We're Looking For
Required Skills:
• 8–12 years of hands-on DevOps / Platform / Infrastructure Engineering experience, with 3+
years at a senior, staff, or principal IC level.
• Deep hands-on GCP expertise — designing and operating large-scale, multi-region production
environments using GKE, Cloud SQL/Spanner, Pub/Sub, Cloud CDN, Cloud Armor, Apigee,
Anthos Service Mesh, and GCP networking (Shared VPC, PSC, VPC-SC, Interconnect).
• Working knowledge of AWS for multi-cloud workloads — EKS, RDS, networking, and IAM.
• Expert-level Kubernetes operations: multi-cluster GKE, workload scheduling, autoscaling,
RBAC, networking (CNI, Ingress, Service Mesh), and admission control.
• Expert with Terraform for multi-cloud IaC — module design, state management, policy-as-code,
and CI-integrated workflows.
• Strong experience with CI/CD tooling (GitHub Actions, GitLab CI, Jenkins, Cloud Build) and
progressive delivery patterns.
• Strong scripting skills in Python and/or Go; comfortable reading application code in
mainstream languages.
• Solid understanding of Linux internals, distributed systems, networking (DNS, TLS, mTLS,
HTTP/2, gRPC), and modern data stores.
• Proven experience designing for high availability and scale at e-commerce scale — load
balancing, stateless service design, multi-tier caching, and graceful degradation.
• Excellent communication and influence — able to drive alignment across engineering, product,
and leadership without relying on positional authority.
Good to Have
• Experience running infrastructure for high-traffic e-commerce events (flash sales, product
launches, or festive-season scaling) at GCC, EMEA, or APAC scale.
• SRE foundations — SLOs, error budgets, capacity planning, and incident command.
• FinOps leadership — managing multi-million-dollar cloud spend with chargeback/showback
models.
• Service mesh (Anthos Service Mesh / Istio) and GitOps (ArgoCD / Flux) in production.
• Platform-level security hardening — container security (Falco), signed images, admission
control.
• Experience with data platform infrastructure: Kafka, Spark, Airflow, dbt, or BigQuery at
production scale.
• Chaos engineering and continuous resilience testing experience.
• Relevant certifications: Google Professional Cloud Architect / DevOps Engineer, AWS
Solutions Architect / DevOps Professional, CKA / CKS.
The Scale You'll Work At
• Millions of active users across multiple countries in the Middle East such as Saudi Arabia,
UAE, Kuwait, Qatar, Bahrain, and Oman.
• High-concurrency commerce events — flash sales and campaign launches driving 10x–20x
baseline traffic.
• Microservices deployed across multi-region Kubernetes clusters with strong tenancy and
security boundaries.
• Cloud architecture hosted on GCP (primary) with additional workloads on AWS, with a focus
on regional proximity to end users across the GCC.
Job ID: 147871881
Skills:
AWS, Kubernetes, DevSecOps principles, Infrastructure-as-Code, observability platforms, cloud-native architectures, CI CD platforms
Skills:
Prometheus, Bash, Grafana, Datadog, Terraform, Helm, Kubernetes, Python, AWS, GitOps, Go, VPC networking, EKS
Skills:
Github, Groovy, Terraform, Docker, Gitlab, Kubernetes, Python, AWS, Azure DevOps, Go, monitoring and observability tooling, Spinnaker
Skills:
Cloudformation, PowerShell, Bash, Grafana, Azure Sql, Jenkins, Terraform, App Services, Microsoft Azure, Arm, Kubernetes, Python, Azure DevOps, Logic Apps, AKS, Log Analytics, Synapse, GitHub Actions, Service Bus, Bicep, Application Insights
Skills:
containerization , Networking, Bash, Jenkins, Terraform, Ansible, Azure, Python, AWS, Infrastructure as Code, Go, Troubleshooting, GitLab Actions, Orchestration technologies, CI CD pipelines, Linux Unix systems administration
We don’t charge any money for job offers