Kale Logistics - Senior Site Reliability Engineer

Kale Logistics Solutions

Pune, India

10-12 Years

This job is no longer accepting applications

Posted a month ago

Job Description

Join Kale Logistics Solutions

Incorporated in 2010, Kale Logistics Solutions is a trusted global cloud-based tech provider for several Fortune 500 companies worldwide, offering a comprehensive suite of tech solutions for the logistics industry. With in-depth domain knowledge and technical expertise, Kale has created a suite of comprehensive enterprise systems and Cargo Community Platforms, which offer a single electronic window capable of supporting operational flows, percolating data to various stakeholders, and facilitating the paperless exchange of trade-related information between stakeholders.

Kale's community and enterprise solutions cater to a wide network of Logistics Service Providers (LSPs) and help strengthen and improve their operational and business capabilities. With offices in India, UAE, Kenya, Netherlands, and North America with 5,500+ clients worldwide across 40 countries, Kale Logistics Solutions is a major player in the industry.

About The Role

We are looking for a highly skilled Senior Site Reliability Engineer (SRE) to join our engineering organization. As a senior member of the team, you will play a key role in designing, building, and operating highly scalable, reliable, and secure systems across cloud and on-prem environments. You will partner closely with product engineering, DevOps, security, and platform teams to drive reliability, improve developer velocity, and operational excellence.

This role requires hands-on experience with large-scale distributed systems, deep expertise in automation and infrastructure engineering, and a passion for reducing toil through code.

What You'll Do

Reliability & Performance :

Ensure availability, resilience, scalability, and performance of production systems
Define, implement, and enforce SLIs, SLOs, and error budgets
Conduct capacity planning, load testing, and performance tuning

Automation & Operations Engineering

Automate manual operational tasks via tooling, scripts, and platform services
Develop infrastructure as code (IaC) for cloud and on-premise environments
Implement CI/CD improvements and production-safe rollout strategies (blue/green, canary, feature toggles)

Observability & Monitoring

Build, manage, and improve logging, metrics, tracing, and alerting
Implement proactive monitoring strategies to detect issues before they impact customers
Own incident management processes including postmortems and runbooks

Security & Compliance

Integrate security controls into pipelines and runtime environments
Enforce least-privilege access, secret management, and vulnerability remediation
Partner with SecOps to ensure compliance in regulated environments

Collaboration & Coaching

Work daily with engineering and DevOps teams to improve system reliability
Mentor junior team members on design, reliability, cloud systems, and operational excellence
Advocate SRE principles across engineering teams

Incident Response & Continuous Improvement

Lead incident triage and recovery
Drive blameless post-incident reviews and systemic fixes
Reduce MTTR through tooling, automation, and resilient architectures

Who You Are

10+ years of experience in SRE/Systems Engineering roles
Expertise in Linux-based systems and distributed architectures
Proficiency in one or more programming/scripting languages : Python, Go, Bash, Java, or similar
Hands-on experience with :
Kubernetes (managed or self-hosted on-prem)
Docker and container ecosystems
Infrastructure automation tools :
Terraform, Helm, etc.
CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, Azure DevOps, etc.)
Cloud experience with at least one major provider (AWS / Azure / GCP)
Strong understanding of :
Networking concepts (DNS, load balancers, VPC, firewalls, NAT, routing)
Observability stacks (Prometheus/Grafana, ELK, Splunk, OpenTelemetry, New Relic, Datadog)
Experience running production systems at scale

Preferred

Experience with on-prem infrastructure, VMware, or hybrid-cloud environments
Database reliability knowledge (PostgreSQL, MySQL, NoSQL-Mongo, caching systems)
Experience with :
Distributed messaging (Kafka, RabbitMQ, SNS/SQS, etc.)
Zero downtime deployments
Background in :
FinOps optimization
Resiliency patterns (circuit breakers, retries, autoscaling)
Certification(s) in cloud platforms or Kubernetes

Why Join Us

Empowerment and Growth : We provide opportunities for continuous learning and development to help you perform at your best.
Inclusive Culture : We celebrate diversity and create an inclusive environment where everyone feels valued and respected.
Innovation : Be part of a team that is driving innovation in the logistics industry with cutting-edge technology solutions.
Global Impact : Work on projects that have a significant impact on global trade and logistics, contributing to the efficiency and sustainability of the industry.

(ref:hirist.tech)