
Search by job, company or skills
At American Express, our culture is built on a 175-year history of innovation, shared values and Leadership Behaviors, and an unwavering commitment to back our customers, communities, and colleagues. From delivering differentiated products to providing world-class customer service, we operate with a strong risk mindset, ensuring we continue to uphold our brand promise of trust, security, and service.
Here, your voice and ideas matter, your work makes an impact, and together, you will help us define the future of American Express.
Role Purpose
The Tech Analyst – Environment & Monitoring is a critical role within the Production Assurance Team, responsible for ensuring 24x7 system availability, stability, and operational readiness across infrastructure, application, and network environments. The role acts as the first line of defense for environment health checks, monitoring operations, and ensuring seamless readiness across UAT and Production platforms. The Customer Journey PME team is a cross-functional, collaborative and innovative team responsible for partnering with engineering and product partners to ensure alignment between the organizations and contribute to the key strategic efforts. This strategic role ensures uninterrupted service delivery, rapid incident response, and continuous improvement of operational processes, partnering closely with business and technology teams
Key Responsibilities
Environment Readiness & Management
Perform environment readiness checks ahead of UAT and Production cutovers, ensuring all systems are validated and deployment-ready
Conduct regular batch job health verification and system configuration audits to ensure compliance with operational standards
Support DR (Disaster Recovery) readiness activities and facilitate quick failover response as required
Maintain system configuration documentation and ensure audit trails are up to date
Ensure SOP for Environment Readiness is created and excuted with evidence of this being executed.
Monitoring & Incident Support
Lead 24x7 incident management, including proactive monitoring, triage, escalation, and resolution, to preserve service availability and minimize downtime
Execute 24x7 proactive monitoring of infrastructure, applications, batch jobs, SFTP, network, DB, firewall/security, and end-user computing metrics
Direct comprehensive root cause analysis and problem management, instituting robust remediation and prevention strategies for recurring operational incidents
Implement software development practices to build observability, alerting, tracing, automation and self-healing capabilities to maintain the highest levels of platform availability.
Participate in incident detection, triage, and escalation, coordinating with Shift Incharge and L1/L2/L3 support teams for timely resolution
Oversee implementation and execution of Disaster Recovery(DR) and business continuity plans, orchestrating readiness drills and post-event reviews with all relevant stakeholders
Assist in major incident management and crisis response under the guidance of the PAT Lead and Shift Incharge
Coordinate with vendor support teams for handshake and resolution of platform issues
Establish strong collaboration with business units, IT operations, compliance and vendor teams to synchronize issue resolution and foster shared ownership of production health
Utilize service analytics, key performance indicators (KPIs), and post-fix insights to drive ongoing process optimization, operational efficiency, and measurable improvement in service reliability
Lead regular operational governance forums, providing transparent reporting on incident trends, recovery status, and change outcomes for executive leadership
Incident Related Regulatory documentation and Communication at appropriate frequency
Be part of a global operations team that support a 24/7 model, willingness to work holidays and weekends.
Access, SFTP & Credential Management
Manage access provisioning, SFTP configuration, and credential management in accordance with security and compliance policies
Support periodic maintenance windows and coordinate planned downtime activities with stakeholders
Ticket Management & Reporting
Manage, track, and report on tickets through the IT ticketing system, ensuring SLA adherence and timely resolution
Contribute to operational analytics and incident trend reporting to identify recurring issues and root causes
Maintain shift logs, knowledge base articles, and handover documentation for continuity
Automation & Continuous Improvement
Identify opportunities for process automation and operational efficiency improvements within the monitoring and environment management space
Implement and manage productivity enhancement tools to reduce manual intervention
Contribute to continuous improvement initiatives under the Optimization, Productivity, and Automation workstream of PAT
Communication & Stakeholder Management
Communicate environment status, outages, and recovery updates to relevant stakeholders in a timely and accurate manner
Support regulatory and vulnerability-related communications
Additional Responsibilities
Hands on contribution to enterprise solutions, tooling, and initiatives leveraging your technical experience.
Implement shift left automated testing to prevent defects from reaching production.
Ensure all new critical subsystems, microservices, databases and external calls meet the 5 9's availability requirement.
Conduct technical code reviews and drive innovation across the organization to adopt industry best practices.
Review all significant functionality changes and peer review critical production hotfixes.
Expected Impact
Achieve highest levels of production stability and service resilience
Deliver efficient incident and problem management, reducing repeat issues and enhancing end-user confidence in technology services
Maintain disciplined control of production changes, safeguarding business operations from risk and regulatory exposure
Enable robust DR preparedness and execution
Provide actionable insights, intelligence, and decision support for executive stakeholders, enabling data-driven investment in production assurance
Required Skills & Experience
10-12 years in IT operations, environment management, or infrastructure monitoring roles
Proficiency in monitoring tools (e.g., Dynatrace, AppDynamics); experience with SFTP, batch job monitoring, and network/DB health checks
Hands-on experience with ticketing platforms (ServiceNow, JIRA, or equivalent)
Working knowledge of cloud infrastructure (AWS/Azure), server environments, firewalls, and network security fundamentals
Exposure to scripting (Shell, Python, or PowerShell) for operational automation is preferred
Strong communication, analytical thinking, and ability to work in a 24x7 shift-based environment
We back you with benefits that support your holistic well-being so you can be and deliver your best. This means caring for you and your loved ones physical, financial, and mental health, as well as providing the flexibility you need to thrive personally and professionally:
Competitive base salaries
Bonus incentives
Support for financial-well-being and retirement
Comprehensive medical, dental, vision, life insurance, and disability benefits (depending on location)
Flexible working model with hybrid, onsite or virtual arrangements depending on role and business need
Generous paid parental leave policies (depending on your location)
Free access to global on-site wellness centers staffed with nurses and doctors (depending on location)
Free and confidential counseling support through our Healthy Minds program
Career development and training opportunities
American Express is an equal opportunity employer and makes employment decisions without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran status, disability status, age, or any other status protected by law.
Offer of employment with American Express is conditioned upon the successful completion of a background verification check, subject to applicable laws and regulations.
Job ID: 145515049