System Engineer / Site Reliability Engineer (SRE)
Key Responsibilities
- Middleware Management: Deploy, configure, and optimize mission-critical middleware, specifically Apache Kafka clusters and Aerospike NoSQL databases, Grafana, Promtheus.
- Observability & Logging: Manage and scale the ELK Stack (Elasticsearch, Logstash, Kibana) for centralized logging, proactive monitoring, and performance analysis.
- Secrets Management: Administer HashiCorp Vault to ensure secure storage and dynamic injection of secrets, certificates, and encryption keys.
- System Patching: Perform routine patching including kafka, ELK, Java patching and reboot activity, log rotation, and backup verification.
- Incident Response & RCA: Troubleshoot complex distributed systems, participate in on-call rotations, and conduct Root Cause Analysis (RCA) to prevent recurrence.
- Performance Tuning: Monitor system health and performance metrics to proactively identify bottlenecks in middleware and cloud infrastructure.
- Security & Compliance: Implement security best practices, including system hardening, vulnerability remediation, and identity management.
Required Skills & Qualifications
- Experience: 3–5 years in Linux System Administration, Middleware Engineering, or SRE roles.
- Middleware Expertise: Hands-on experience managing Kafka (brokers, topics, replication) and Aerospike (clustering, storage engines).
- Security Tools: Strong understanding of HashiCorp Vault administration (policies, secret engines, and auth methods).
- ELK Stack: Proficiency in managing Elasticsearch clusters, Logstash pipelines, and Kibana visualizations.
- Scripting: Advanced Python or Bash scripting for automation and system integrations.
- Networking: Deep knowledge of TCP/IP, DNS, Firewalls, Load Balancers, and AWS networking.
- High Availability: Solid understanding of clustering, disaster recovery, and failover strategies for middleware tools.
Good to Have (Preferred Skills)
- Certifications: RHCE/RHCSA RHEL Administrator or HashiCorp Certified: Vault Associate.
- Databases: Basic administration of relational databases like MySQL or PostgreSQL.
- Ticketing Systems: Experience with ServiceNow, JIRA, or Remedy for incident and change management.
- Production Support: Proven track record of managing large-scale production environments with strict SLAs.