Search by job, company or skills

Hdfc Securities

Site Reliability Engineer

Save
  • Posted a day ago
  • Be among the first 10 applicants
Early Applicant

Job Description

As a Site Reliability Engineer - Application Support, you will:

  • Ensure System Reliability & Availability: Monitor, troubleshoot, and maintain critical backend applications and infrastructure to meet SLA/SLO targets and ensure high availability of trading platforms
  • Implement SRE Best Practices: Design and implement monitoring, alerting, and observability solutions using tools like Grafana, Dynatrace, and Elasticsearch to proactively identify and resolve issues
  • Automate Operations: Develop automation scripts and tools using Linux shell scripting and Python to reduce manual intervention, improve system efficiency, and eliminate toil
  • Manage Cloud Infrastructure: Work with AWS services and terraform to provision, manage, and optimize cloud infrastructure while ensuring cost efficiency and security
  • Container Orchestration: Manage and troubleshoot Kubernetes clusters and deployments, ensuring optimal performance and resource utilization
  • Incident Response & Management: Participate in on-call rotations, lead incident response efforts, perform root cause analysis, and implement preventive measures to reduce recurrence
  • Performance Optimization: Conduct performance testing, capacity planning, and load testing to ensure systems can handle peak trading hours and scale effectively
  • CI/CD Pipeline Understanding: Work with CI/CD tools like GitLab Runner and Argo CD to ensure smooth and reliable deployment processes
  • Database Support: Troubleshoot and optimize Redis caching layers and Oracle databases, including writing and debugging PL/SQL queries for performance tuning
  • Collaboration & Documentation: Work closely with development teams to improve application reliability, create runbooks, SOPs, and maintain comprehensive technical documentation
  • Continuous Improvement: Analyze system metrics, identify bottlenecks, and propose architectural improvements to enhance reliability and performance

We are looking for someone with:

● 5-7 years of hands-on experience in SRE, DevOps, or Application Support roles, preferably in high-availability production environments

● Linux Administration: Strong experience with Linux systems, proficiency in shell scripting for automation, system monitoring, and troubleshooting

● Kubernetes: Hands-on experience managing Kubernetes clusters, troubleshooting pod issues, analyzing logs, configuring deployments, and understanding networking concepts

● AWS Cloud Services: Working knowledge of AWS services (EC2, S3, RDS, Lambda, CloudWatch, ECS, etc.) with experience in troubleshooting and optimizing cloud infrastructure

● Infrastructure as Code: Experience with Terraform or similar tools for provisioning and managing cloud resources

● Monitoring & Observability: Practical experience with APM tools (Dynatrace or similar), Grafana for dashboard creation, and log analysis using Elasticsearch/Kibana

● Database Management: Experience with Redis for caching solutions and Oracle databases, including basic PL/SQL querying and performance troubleshooting

● CI/CD Tools: Familiarity with GitLab, Jenkins, Argo CD, or similar CI/CD platforms for deployment automation

● Scripting & Programming: Proficiency in shell scripting; knowledge of Python/shell or other scripting languages is a plus

● Incident Management: Experience with ServiceNow or similar ITSM tools, understanding of ITIL framework for incident, problem, and change management

● SRE Principles: Understanding of SLIs, SLOs, SLAs, error budgets, and capacity planning concepts

● Problem-Solving Skills: Strong analytical and troubleshooting abilities with attention to detail

● Communication Skills: Ability to collaborate effectively with cross-functional teams and document technical processes clearly

● Education: Bachelors degree in computer science, Information Technology, or equivalent practical experience

Following aspects would be a plus:

  • Prior experience in FinTech, Banking, or Financial Services industries with understanding of regulatory compliance requirements
  • Experience with containerization technologies (Docker, Podman) and container security best practices
  • Knowledge of API Gateway technologies (Kong, AWS API Gateway, etc.) for managing microservices communication
  • Familiarity with chaos engineering and failure injection practices
  • Experience with configuration management tools (Ansible, Chef, Puppet)
  • Understanding of networking concepts, load balancers, and CDN technologies
  • ITIL Foundation certification or strong working knowledge of ITIL processes
  • Experience with security scanning tools and implementing security best practices in DevOps pipelines
  • Contributions to open-source projects or active participation in technical communities
  • Experience with disaster recovery planning and business continuity processes.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 149363725

Similar Jobs

Mumbai, India

Skills:

ElkCloudformationPrometheusBashGrafanaJenkinsGcpTerraformAnsibleKubernetesPythonAWSOpenTelemetry

Delhi, Kolkata, Mumbai

Skills:

TerraformSaasKubernetesIncident ResponseAI-powered AutomationObservability

Mumbai, India

Skills:

NginxGrafanaSdnRedisRubyPrometheusMySQLKubernetesPythonJenkinsGitElk StackEnvoyTCP IP routingGoArgoCDOn Prem Cloud data center

Mumbai, India

Skills:

Bash ScriptingKubernetesDockerPodmanLinux systemsServerless Architectureevent stream processing

Delhi, Kolkata, Mumbai

Skills:

KubernetesPythonSoftware DevelopmentCCloudJavaSaas