Senior Site Reliability Engineer (SRE) – DBaaS Platform (Automation)

Tessell

Bengaluru, India

8-10 Years

Save

Posted 3 days ago
Be among the first 10 applicants

Early Applicant

Job Description

Job Title: Senior Site Reliability Engineer (SRE) - DBaaS Platform (Automation)
Location: Bangalore
Department: Customer Success
Reports To: VP Customer Success
Role Overview
We are seeking a highly skilled Senior SRE to lead reliability engineering for our cloud-
native Database-as-a-Service (DBaaS) platform. This role will drive automation-first
operations, SRE agent architecture, AI-enabled incident acceleration, and SLO-driven
reliability governance across AWS, Azure, and GCP environments.
You will operate at the intersection of platform engineering, cloud infrastructure,
database reliability, and automation - building self-healing, scalable, and cost-efficient
systems.

Key Responsibilities
1. SRE Agent Architecture & Technical Ownership
. Design and own SRE automation agents for proactive monitoring, remediation,
and performance optimization.
. Build event-driven reliability frameworks integrated with observability platforms.
. Define extensible architectures for auto-detection, auto-healing, and intelligent
alert reduction.
2. Automation Roadmap Leadership
. Own the automation strategy across DBaaS lifecycle (provisioning, scaling,
patching, backup, DR).
. Drive infrastructure and operational automation maturity.
. Eliminate toil through scripting, tooling, and CI/CD integration.
3. Engineering-Driven Reliability & SLO Governance
. Define and manage SLIs, SLOs, and error budgets.
. Implement reliability scorecards and availability governance.
. Partner with Product and Engineering to embed SRE practices into platform
design.
4. AI-Enabled Operational Acceleration
. Integrate AI/ML-based anomaly detection and predictive scaling.
. Enable automated RCA enrichment using log analytics and telemetry intelligence.
. Drive AI-assisted runbooks and decision frameworks.
5. Strong Programming Expertise
. Develop automation frameworks using Python and/or Go.
. Build scalable microservices for reliability orchestration.
. Contribute to platform APIs and reliability tooling.

6. Infrastructure as Code (IaC) Mastery
. Architect and manage infrastructure using Terraform.
. Implement policy-as-code and compliance automation.
. Ensure consistent multi-cloud deployments.
7. Multi-Cloud Expertise
. Deep hands-on experience with AWS, Azure, and GCP.
. Design high-availability, multi-region architectures.
. Implement secure, scalable network and storage solutions across clouds.
8. Containerization & Orchestration
. Strong hands-on with Docker and Kubernetes.
. Build and manage stateful workloads in Kubernetes.
. Implement scaling, failover, and resilience patterns.
9. Cloud Networking & Security
. Strong understanding of VPC/VNet, peering, routing, firewalls, IAM, encryption.
. Implement Zero-Trust and least-privilege access models.
. Embed security into reliability workflows.
10. Database Reliability & High Availability
. Experience managing HA architectures for relational and NoSQL databases.
. Strong knowledge of replication, failover, backup, DR, PITR.
. Performance tuning and capacity planning expertise.
11. Incident Leadership & RCA Excellence
. Lead critical incident response (P1/P2).
. Conduct structured RCA and preventive action planning.
. Build post-incident automation improvements.
12. Cost Optimization & Operational Efficiency
. Implement FinOps practices for DBaaS workloads.
. Optimize compute, storage, and licensing costs.
. Drive performance-per-dollar improvements.
13. Cross-Team Technical Leadership
. Mentor junior SREs and platform engineers.
. Collaborate with Product, DBA, Security, and Dev teams.
. Influence architecture decisions with reliability-first mindset.

Required Qualifications
. 8+ years in SRE / DevOps / Platform Engineering roles.
. 3+ years in multi-cloud production environments.

. Strong programming expertise in Python and/or Go.
. Deep experience with Terraform and infrastructure automation.
. Hands-on Kubernetes production experience.
. Experience managing large-scale database platforms.
. Strong understanding of observability (metrics, logs, traces).

Preferred Qualifications
. Experience in DBaaS or SaaS platform companies.
. Experience with AI-driven monitoring/operations.
. Knowledge of distributed systems internals.
. Experience implementing SRE best practices at scale.

Key Competencies
. Systems thinking
. Automation-first mindset
. Bias for engineering over manual ops
. Data-driven decision making
. Strong ownership and accountability
. Executive-level communication during incidents