Site Reliability Engineer

Hdfc Securities

Mumbai, India

5-7 Years

Save

Posted a day ago
Be among the first 10 applicants

Early Applicant

Job Description

As a Site Reliability Engineer - Application Support, you will:

Ensure System Reliability & Availability: Monitor, troubleshoot, and maintain critical backend applications and infrastructure to meet SLA/SLO targets and ensure high availability of trading platforms
Implement SRE Best Practices: Design and implement monitoring, alerting, and observability solutions using tools like Grafana, Dynatrace, and Elasticsearch to proactively identify and resolve issues
Automate Operations: Develop automation scripts and tools using Linux shell scripting and Python to reduce manual intervention, improve system efficiency, and eliminate toil
Manage Cloud Infrastructure: Work with AWS services and terraform to provision, manage, and optimize cloud infrastructure while ensuring cost efficiency and security
Container Orchestration: Manage and troubleshoot Kubernetes clusters and deployments, ensuring optimal performance and resource utilization
Incident Response & Management: Participate in on-call rotations, lead incident response efforts, perform root cause analysis, and implement preventive measures to reduce recurrence
Performance Optimization: Conduct performance testing, capacity planning, and load testing to ensure systems can handle peak trading hours and scale effectively
CI/CD Pipeline Understanding: Work with CI/CD tools like GitLab Runner and Argo CD to ensure smooth and reliable deployment processes
Database Support: Troubleshoot and optimize Redis caching layers and Oracle databases, including writing and debugging PL/SQL queries for performance tuning
Collaboration & Documentation: Work closely with development teams to improve application reliability, create runbooks, SOPs, and maintain comprehensive technical documentation
Continuous Improvement: Analyze system metrics, identify bottlenecks, and propose architectural improvements to enhance reliability and performance

We are looking for someone with:

● 5-7 years of hands-on experience in SRE, DevOps, or Application Support roles, preferably in high-availability production environments

● Linux Administration: Strong experience with Linux systems, proficiency in shell scripting for automation, system monitoring, and troubleshooting

● Kubernetes: Hands-on experience managing Kubernetes clusters, troubleshooting pod issues, analyzing logs, configuring deployments, and understanding networking concepts

● AWS Cloud Services: Working knowledge of AWS services (EC2, S3, RDS, Lambda, CloudWatch, ECS, etc.) with experience in troubleshooting and optimizing cloud infrastructure

● Infrastructure as Code: Experience with Terraform or similar tools for provisioning and managing cloud resources

● Monitoring & Observability: Practical experience with APM tools (Dynatrace or similar), Grafana for dashboard creation, and log analysis using Elasticsearch/Kibana

● Database Management: Experience with Redis for caching solutions and Oracle databases, including basic PL/SQL querying and performance troubleshooting

● CI/CD Tools: Familiarity with GitLab, Jenkins, Argo CD, or similar CI/CD platforms for deployment automation

● Scripting & Programming: Proficiency in shell scripting; knowledge of Python/shell or other scripting languages is a plus

● Incident Management: Experience with ServiceNow or similar ITSM tools, understanding of ITIL framework for incident, problem, and change management

● SRE Principles: Understanding of SLIs, SLOs, SLAs, error budgets, and capacity planning concepts

● Problem-Solving Skills: Strong analytical and troubleshooting abilities with attention to detail

● Communication Skills: Ability to collaborate effectively with cross-functional teams and document technical processes clearly

● Education: Bachelors degree in computer science, Information Technology, or equivalent practical experience

Following aspects would be a plus:

Prior experience in FinTech, Banking, or Financial Services industries with understanding of regulatory compliance requirements
Experience with containerization technologies (Docker, Podman) and container security best practices
Knowledge of API Gateway technologies (Kong, AWS API Gateway, etc.) for managing microservices communication
Familiarity with chaos engineering and failure injection practices
Experience with configuration management tools (Ansible, Chef, Puppet)
Understanding of networking concepts, load balancers, and CDN technologies
ITIL Foundation certification or strong working knowledge of ITIL processes
Experience with security scanning tools and implementing security best practices in DevOps pipelines
Contributions to open-source projects or active participation in technical communities
Experience with disaster recovery planning and business continuity processes.

More Info

Job Type:

Permanent Job

Industry:

Other

Function:

Site Reliability Engineering

Employment Type:

Full time

About Company

Hdfc SecuritiesJob Source: www.linkedin.com

Job ID: 149363725

Jobs by Skill - IT

Jobs by Skill - Non IT

International Jobs

Last Updated: 21-06-2026 05:53:14 PM

Homejobs in MumbaiSite Reliability Engineer

Similar Jobs

Senior Site Reliability Engineer

Seclore

4-6 yrs

Mumbai, India

Skills:

Elk, Cloudformation, Prometheus, Bash, Grafana, Jenkins, Gcp, Terraform, Ansible, Kubernetes, Python, AWS, OpenTelemetry

Senior Site Reliability Engineer

Avalara Technologies

5-10 yrs

Delhi, Kolkata, Mumbai

Skills:

Terraform, Saas, Kubernetes, Incident Response, AI-powered Automation, Observability

Lead Site Reliability Engineer

Medianet

6-10 yrs

Mumbai, India

Skills:

Nginx, Grafana, Sdn, Redis, Ruby, Prometheus, MySQL, Kubernetes, Python, Jenkins, Git, Elk Stack, Envoy, TCP IP routing, Go, ArgoCD, On Prem Cloud data center