Principal Engineer - SRE (Site Reliability Engineering)

Styli

Bengaluru, India

6-10 Years

Save

Posted 2 days ago
Be among the first 10 applicants

Early Applicant

Job Description

Role: Principal Engineer – SRE (Site Reliability Engineering)

Location: Bangalore

Department: Platform Engineering / SRE

Experience: 6 to 10 Years

About Styli Marketplace

Launched in 2019 by Landmark Group, Styli Marketplace is the first e-commerce venture of the group, quickly becoming a leading online destination for fashion and lifestyle across the GCC, including Saudi Arabia, the UAE, Kuwait, Bahrain, and beyond.

We connect global sellers and creators with millions of fashion-forward customers, offering the latest trends, exceptional value, and convenient services like same-day to 48-hour delivery and flexible payment options. Our mission is to make style accessible, aspirational, and exciting for all, backed by a passionate team fostering a culture of creativity and innovation. At Styli, we aim to revolutionize fashion retail and bring unique experiences to our customers.

What Are We Looking For

We are looking for a Principal Site Reliability Engineer who combines deep engineering instincts with an operational mindset. You don't just respond to incidents - you design systems that make incidents rare. You don't just monitor dashboards - you build the observability that makes dashboards meaningful.

People who grow fastest at Styli are those who find problems before the business feels them, then engineer their way out of them permanently.

Job Responsibilities

Reliability & Scale

Own and improve service-level objectives (SLOs), error budgets, and SLIs for critical customer-facing and internal services.
Design and implement high-availability, fault-tolerant architectures capable of sustaining peak traffic events (flash sales, regional campaigns) without degradation.
Lead blameless post-incident reviews and drive root-cause elimination — not just mitigation.
Partner with product engineering teams during design and pre-production phases to proactively identify reliability and scalability risks.

Infrastructure & Cloud (GCP)

Manage and scale production infrastructure on Google Cloud Platform (GCP), including GKE, Cloud Run, Cloud SQL, Pub/Sub, and GCS.
Design multi-region, disaster-resilient infrastructure that meets aggressive RTO/RPO targets.
Optimize cloud costs without compromising reliability—rightsizing, committed use discounts, and workload scheduling.
Implement and maintain robust networking, VPC design, load balancing, and CDN configuration on GCP.

Kubernetes & Container Orchestration

Operate, scale, and harden Google Kubernetes Engine (GKE) clusters across environments (dev, staging, production).
Define and enforce standards for container image management, resource quotas, HPA/VPA policies, and pod disruption budgets.
Implement advanced Kubernetes patterns — blue/green deployments, canary releases, and progressive delivery using tools such as Argo Rollouts or Flagger.
Ensure cluster-level and workload-level security through RBAC, network policies, and admission controllers (OPA/Gatekeeper).

Automation & Toil Reduction

Identify and systematically eliminate toil through automation — if you do something twice manually, the third time it should run itself.
Develop self-healing runbooks, automated remediation scripts, and on-call tooling to reduce mean time to recovery (MTTR).

Observability & Incident Management

Build and maintain a best-in-class observability stack — metrics (Prometheus/Cloud Monitoring), logs (Cloud Logging/ELK), and traces (Cloud Trace/Jaeger/Open Telemetry).
Define alerting standards that prioritize signal over noise — actionable alerts, not alert fatigue.
Own and continuously improve the incident management lifecycle, including on-call rotations, escalation paths, and post-mortem culture.

Collaboration & Engineering Culture

Act as a reliability advocate embedded within cross-functional engineering squads.
Conduct reliability reviews and load/chaos testing exercises prior to major launches.
Mentor junior engineers and champion SRE best practices across the organization.

Qualifications:

Must-Have

6 to 10 years of hands-on SRE experience in a high-traffic, consumer-facing product environment.
Strong proficiency with GCP services (GKE, Cloud Run, Pub/Sub, Cloud SQL, Cloud Armor, GCS, IAM).
Deep operational expertise in Kubernetes — from cluster administration to workload tuning and troubleshooting.
Experience with Terraform and Helm for infrastructure and application lifecycle management.
Proficiency in at least one scripting/programming language — Python, Go, or Bash — for automation and tooling.
Hands-on experience with CI/CD platforms and GitOps practices.
Strong understanding of distributed systems, networking fundamentals (DNS, TLS, TCP/IP, HTTP/2), and database reliability patterns.

Good to Have

Experience with chaos engineering tools (Chaos Mesh, LitmusChaos, Gremlin).
Familiarity with service mesh technologies (Istio, Linkerd).
Exposure to FinOps practices and cloud cost governance.
Experience in eCommerce, fintech, or other high-availability consumer platforms at scale.
Certifications: Google Cloud Professional Cloud DevOps Engineer or Professional Cloud Architect or Certified Kubernetes Administrator (CKA) or Certified Kubernetes Application Developer (CKAD)