PARTNER CONSULTANT - Reliability Analysis

Happiest Minds Technologies

Bengaluru, India

2-4 Years

Save

Posted 4 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Site Reliability Engineer
What will you do:
You'll be a key part of our Infrastructure Platform team, focusing on the critical infrastructure that powers . Beyond core infrastructure work, you'll also collaborate closely with a product development team, offering your expertise to coach and guide them on infrastructure and architectural decisions.

In your day-to-day, you will:

Build and maintain our production infrastructure to ensure scalability and high availability, while maximizing development team efficiency.
Troubleshoot and debug issues related to both product and infrastructure.
Automate everything! If something's worth doing, it's definitely worth automating.
Improve and extend our Kubernetes platform, which leverages EKS.
Provide crucial insights into scalability for our developers.
Participate in an on-call rotation to support our production systems.
Who you are:
You're someone who loves ownership: you design it, you build it, you own it! You're a self-motivated individual and a strong team player within the Infrastructure Platform team.
You have at least 2 years of experience working as a DevOps Engineer (or a similar role like Software Engineer or Cloud Engineer).
You have proven experience in architecting systems based on both functional and non-functional requirements.
Your qualifications
You should be proficient in, or have solid knowledge of:
Observability & Reliability
SLO/SLI Management:
Experience defining and implementing Service Level Objectives (SLO) and Service Level Indicators (SLI) to measure service health.
Modern Observability: Proficiency with high-cardinality observability platforms; Honeycomb experience is a major plus, but experience with similar tools (e.g., New Relic, Datadog) is welcome.
Pro-active Monitoring: Proven ability to move beyond basic threshold alerts toward trend-based, pro-active alerting and distributed tracing.
Incident Response: Experience with blameless post-mortems and a focus on reducing toil through automation.
Infrastructure & Orchestration
Containerization: Proficient in Container Orchestration and technologies such as Kubernetes and Docker.
Service Mesh: Experience with Istio for traffic management, security, and microservices observability.
Public Cloud: Strong hands-on experience with AWS.
Linux: Deep knowledge of Linux-based systems.
Automation & Data
CI/CD: Experience with Jenkins or GitHub Actions.
Cloud Orchestration: Proficiency in Terraform and Ansible for automation and service configuration.
Data Engines: Familiarity with SQL, NoSQL, OpenSearch, and AWS S3.
Programming: Proficiency in at least one of our core languages: Python, TypeScript, or Java.