Search by job, company or skills

Airtel Payments Bank

Site Reliability Engineer

Save
new job description bg glownew job description bg glownew job description bg svg
  • Posted 2 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

System Engineer / Site Reliability Engineer (SRE)

Key Responsibilities

  • Middleware Management: Deploy, configure, and optimize mission-critical middleware, specifically Apache Kafka clusters and Aerospike NoSQL databases, Grafana, Promtheus.
  • Observability & Logging: Manage and scale the ELK Stack (Elasticsearch, Logstash, Kibana) for centralized logging, proactive monitoring, and performance analysis.
  • Secrets Management: Administer HashiCorp Vault to ensure secure storage and dynamic injection of secrets, certificates, and encryption keys.
  • System Patching: Perform routine patching including kafka, ELK, Java patching and reboot activity, log rotation, and backup verification.
  • Incident Response & RCA: Troubleshoot complex distributed systems, participate in on-call rotations, and conduct Root Cause Analysis (RCA) to prevent recurrence.
  • Performance Tuning: Monitor system health and performance metrics to proactively identify bottlenecks in middleware and cloud infrastructure.
  • Security & Compliance: Implement security best practices, including system hardening, vulnerability remediation, and identity management.

Required Skills & Qualifications

  • Experience: 3–5 years in Linux System Administration, Middleware Engineering, or SRE roles.
  • Middleware Expertise: Hands-on experience managing Kafka (brokers, topics, replication) and Aerospike (clustering, storage engines).
  • Security Tools: Strong understanding of HashiCorp Vault administration (policies, secret engines, and auth methods).
  • ELK Stack: Proficiency in managing Elasticsearch clusters, Logstash pipelines, and Kibana visualizations.
  • Scripting: Advanced Python or Bash scripting for automation and system integrations.
  • Networking: Deep knowledge of TCP/IP, DNS, Firewalls, Load Balancers, and AWS networking.
  • High Availability: Solid understanding of clustering, disaster recovery, and failover strategies for middleware tools.

Good to Have (Preferred Skills)

  • Certifications: RHCE/RHCSA RHEL Administrator or HashiCorp Certified: Vault Associate.
  • Databases: Basic administration of relational databases like MySQL or PostgreSQL.
  • Ticketing Systems: Experience with ServiceNow, JIRA, or Remedy for incident and change management.
  • Production Support: Proven track record of managing large-scale production environments with strict SLAs.

More Info

About Company

Job ID: 146431969

Similar Jobs