Site Reliability Engineer

3-5 Years

Save

Early Applicant

Job Description

System Engineer / Site Reliability Engineer (SRE)

Key Responsibilities

Middleware Management: Deploy, configure, and optimize mission-critical middleware, specifically Apache Kafka clusters and Aerospike NoSQL databases, Grafana, Promtheus.
Observability & Logging: Manage and scale the ELK Stack (Elasticsearch, Logstash, Kibana) for centralized logging, proactive monitoring, and performance analysis.
Secrets Management: Administer HashiCorp Vault to ensure secure storage and dynamic injection of secrets, certificates, and encryption keys.
System Patching: Perform routine patching including kafka, ELK, Java patching and reboot activity, log rotation, and backup verification.
Incident Response & RCA: Troubleshoot complex distributed systems, participate in on-call rotations, and conduct Root Cause Analysis (RCA) to prevent recurrence.
Performance Tuning: Monitor system health and performance metrics to proactively identify bottlenecks in middleware and cloud infrastructure.
Security & Compliance: Implement security best practices, including system hardening, vulnerability remediation, and identity management.

Required Skills & Qualifications

Experience: 3–5 years in Linux System Administration, Middleware Engineering, or SRE roles.
Middleware Expertise: Hands-on experience managing Kafka (brokers, topics, replication) and Aerospike (clustering, storage engines).
Security Tools: Strong understanding of HashiCorp Vault administration (policies, secret engines, and auth methods).
ELK Stack: Proficiency in managing Elasticsearch clusters, Logstash pipelines, and Kibana visualizations.
Scripting: Advanced Python or Bash scripting for automation and system integrations.
Networking: Deep knowledge of TCP/IP, DNS, Firewalls, Load Balancers, and AWS networking.
High Availability: Solid understanding of clustering, disaster recovery, and failover strategies for middleware tools.

Good to Have (Preferred Skills)

Certifications: RHCE/RHCSA RHEL Administrator or HashiCorp Certified: Vault Associate.
Databases: Basic administration of relational databases like MySQL or PostgreSQL.
Ticketing Systems: Experience with ServiceNow, JIRA, or Remedy for incident and change management.
Production Support: Proven track record of managing large-scale production environments with strict SLAs.