We are seeking a seasoned
Site Reliability Engineer (SRE) with a solid background in
payment systems and high-availability architectures. The ideal candidate will have hands-on experience managing large-scale, distributed systems in production, with a deep understanding of reliability, scalability, and performance tuning in the financial services or payments industry.
Key Responsibilities
- Design, build, and maintain scalable, resilient, and secure infrastructure for high-volume payment platforms.
- Ensure system uptime, reliability, and performance through effective monitoring, alerting, and incident response strategies.
- Collaborate with software engineering and DevOps teams to implement CI/CD pipelines and improve deployment efficiency.
- Automate infrastructure management tasks using Infrastructure-as-Code (IaC) tools (Terraform, Ansible, etc.).
- Proactively identify and mitigate system bottlenecks, failures, and potential points of failure.
- Manage disaster recovery strategies, failover planning, and performance testing for critical payment services.
- Work with development teams to ensure services are designed for reliability, scalability, and observability from the ground up.
- Participate in root cause analysis and post-incident reviews to prevent future outages.
Required Skills & Experience
- 8+ years of overall experience in infrastructure engineering or SRE roles, with at least 3+ years in the payments/fintech domain.
- Strong understanding of payment protocols (UPI, IMPS, RTGS, NEFT, SWIFT, etc.) and transaction processing systems.
- Proven expertise in Linux systems administration, cloud platforms (AWS, GCP, or Azure), and container orchestration (Kubernetes).
- Solid experience with monitoring/logging tools like Prometheus, Grafana, ELK Stack, Splunk, etc.
- Proficiency in one or more scripting languages (Python, Shell, Go, etc.) for automation.
- Experience with incident management, SLAs, and system troubleshooting in high-pressure environments.
- Familiarity with security and compliance practices in the financial sector (e.g., PCI-DSS, ISO 27001).
Preferred Qualifications
- Previous experience supporting mission-critical applications in banking or financial services.
- Exposure to Kafka, Redis, or other real-time streaming and caching technologies.
- Experience with Site Reliability Engineering principles and implementing SLOs/SLIs.
- Understanding of the Error Budget (EL) concept and how it ties into availability and release decisions.
- Experience on any performance testing tool like K6, JMeter, LoadRunner.
- Familiarity with mocking tools like Mockito, WireMock, Microcks.