There's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems.
As a Site Reliability Engineer III at JPMorgan Chase within the Infrastructure Platforms team, youwill solve complex and broad business problems with simple and straightforward solutions. Through code and cloud infrastructure, you will configure, maintain, monitor, and optimize applications and their associated infrastructure to independently decompose and iteratively improve on existing solutions. You are a significant contributor to your team by sharing your knowledge of end-to-end operations, availability, reliability, and scalability of your application or platform.
Job responsibilities
- Own L1.5/L2 production support, participate in oncall rotations, and drive rapid triage, containment, and recovery for incidents.
- Lead postincident reviews and implement preventative actions to eliminate repeat issues and reduce operational risk.
- Define and maintain SLIs/SLOs and error budgets for critical user journeys, integrating them with change guardrails to balance velocity and reliability.
- Conduct capacity and performance analysis design and validate resilience patterns such as high availability, failover, and disaster recovery tests.
- Implement and standardize metrics, logs, and traces build actionable dashboards and alerts that improve signaltonoise.
- Tune alert policies to reduce noise and improve MTTD/MTTR, leveraging APM/AIOps to accelerate rootcause analysis.
- Build and maintain CI/CD pipelines (e.g., Jenkins, GitHub Actions, GitLab CI), manage artifact/versioning, and orchestrate environment promotions.
- Enable pre/postdeploy checks, canary/bluegreen strategies where feasible, and automated rollback to reduce change failure rate.
- Develop Pythonbased automation for selfhealing, runbook execution, health checks, and operational workflows with tests and code quality gates.
- Support and harden platform/data components including Redis, RDBMS, and Kafka by managing topic lifecycle, capacity, retention, replication, and failover.
- Adhere to governance by executing change management, patching, and vulnerability SLAs maintain environment configuration integrity across dev/QA/UAT/prod.
Required qualifications, capabilities, and skills
Preferred qualifications, capabilities, and skills
- Operate within highly regulated or largescale environments, ideally including financial services.
- Implement Infrastructure as Code with Terraform/Ansible and run containers with Docker/Kubernetes.
- Adopt progressive delivery with feature flags and canaries, integrate automated testing frameworks, and enforce policyascode.
- Execute performance engineering, load testing, capacity modeling, and optimize costs with uniteconomics dashboards.