Site Reliability Engineer

Tookitaki

Bengaluru, India

3-6 Years

Save

Posted 21 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Position Overview

Job Title: Site Reliability Engineer (SRE)

Department: Technology

Location: Bangalore

Reporting To: Head of Infra

Tookitaki is looking for a Site Reliability Engineer (SRE) with 3–6 years of experience to help maintain and scale the infrastructure that powers our flagship products—FinCense and the AFC Ecosystem. As an SRE, you will work at the intersection of software engineering and infrastructure, ensuring high availability, performance, and scalability of our platforms. You will collaborate with engineering, DevOps, and client success teams to operationalize deployments across on-premise, VPC, and Compliance as a Service (CaaS) environments while improving monitoring, automation, and incident response.

Position Purpose

The SRE role is responsible for ensuring the reliability and efficiency of Tookitaki's production systems and environments. This includes building monitoring systems, improving deployment pipelines, automating routine operations, and responding to production incidents. You'll help build a resilient infrastructure that supports our mission to provide AI-driven solutions that prevent financial crime.

Key Responsibilities

System Monitoring & Incident Management

Build and maintain monitoring, alerting, and logging systems using tools like Prometheus, Grafana, and ELK.
Respond to incidents and outages, conduct post-mortems, and implement corrective actions.

Infrastructure & Deployment Automation

Automate infrastructure provisioning and application deployment using Terraform, Ansible, or Helm.
Contribute to CI/CD pipelines, improve reliability and speed of software delivery (GitLab CI, Jenkins, etc.).

Container & Orchestration Management

Manage and troubleshoot Docker containers and Kubernetes clusters, ensuring workload scaling, resource management, and health.
Support application updates, rollbacks, and blue-green or canary deployments.

Cloud & Platform Operations

Operate within AWS (preferred) or GCP environments (EC2, S3, VPC, IAM).
Monitor system availability and resource usage across environments.

Security & Reliability Enhancements

Implement and monitor TLS/SSL, RBAC, SSO, and secure API practices.
Support compliance and security audit activities by maintaining logs, access controls, and operational hygiene.

Collaboration & Documentation

Work closely with developers, infra engineers, and support teams to ensure production readiness.
Maintain playbooks, runbooks, and system documentation for reliability engineering activities.

Qualifications and Skills

Education

Bachelor's degree in Computer Science, Engineering, or related technical field.

Experience

3–6 years in Site Reliability Engineering, DevOps, or a related role.
Experience with production environments and live system debugging.

Technical Skills

Kubernetes, Docker, Helm – experience deploying and scaling services.
Linux administration and command-line debugging.
Hands-on with AWS (preferred) or GCP cloud platforms.
Scripting in Bash and Python for automation and monitoring tasks.
Experience with monitoring and alerting tools like Prometheus, Grafana, ELK, or Datadog.
Familiarity with databases (e.g., MariaDB, ScyllaDB) and SQL/CQL querying.

Soft Skills

Strong problem-solving and debugging skills.
Ability to work in on-call rotations and high-pressure production environments.
Excellent communication and documentation abilities.

Key Competencies

Operational Reliability: Ensures system uptime and performance through proactive monitoring and maintenance.
Automation Mindset: Reduces manual effort through scripting and tooling.
Incident Response: Quick identification and resolution of issues to minimize downtime.
Cross-Functional Collaboration: Works effectively with engineering, support, and infra teams.
Security Awareness: Applies best practices in infrastructure and platform security.

Success Metrics

Maintain 99.9%+ uptime across production environments.
Reduce mean time to detect (MTTD) and mean time to resolve (MTTR) for critical incidents.
Increase in automation coverage and reduction in manual deployment steps.
High internal satisfaction from developers on CI/CD and platform reliability.
Compliance readiness and security log availability for audits.

Benefits

Competitive compensation
Work on a globally recognized RegTech platform transforming financial crime prevention. Exposure to cutting-edge AI and big data infrastructure (Spark, Kafka, ScyllaDB, Flink).

More Info

Job Type:

Industry:

Function:

Employment Type:

About Company

TookitakiJob Source: www.linkedin.com

Job ID: 147194219

Jobs by Skill - IT

Jobs by Skill - Non IT

3-5 yrs

Bengaluru, India

Skills:

Cloudformation, Prometheus, Grafana, Pulumi, Datadog, Jenkins, Linux, Docker, Terraform, Ansible, AWS IAM, Puppet, Kubernetes, Python, AWS, Chef, Go, EKS, GitLab CI, GitHub Actions

Do you want to see more relevant and perfect job for you?

Beware of Scammers

We don’t charge any money for job offers

What it feels like to have

48% more interview calls?

To get 5X more recruiter views on your profile

Real-time notifications

Discover new jobs, get recruiter notifications, track applications & more with the foundit App.

Scan to download foundit App