Job Summary:
We are seeking a skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of critical systems and applications. The ideal candidate will have strong expertise in monitoring tools (Splunk, New Relic), and hands-on experience managing and optimizing AWS API Gateway and other cloud-native services.
Key Responsibilities:
- Ensure system reliability, uptime, and scalability through proactive monitoring and incident management.
- Develop and maintain observability dashboards and alerts using Splunk and New Relic.
- Manage and optimize API Gateway (AWS API Gateway) configurations for secure and efficient traffic handling.
- Collaborate with development and DevOps teams to automate deployments and implement best SRE practices.
- Conduct root cause analysis (RCA) for incidents and drive post-incident improvements.
- Implement performance tuning, fault-tolerant systems, and high-availability solutions.
- Maintain infrastructure as code and support continuous integration and delivery (CI/CD) pipelines.
Required Skills & Experience:
Primary Skills:
- Site Reliability Engineering (SRE)
- Splunk (Monitoring & Log Analysis)
- New Relic (Application Performance Monitoring)
- AWS API Gateway
Secondary Skills:
- AWS Cloud Infrastructure
- CI/CD Automation
- Incident & Problem Management