Senior Systems Operations Engineer - Production Support, SRE, ITIL

The Wells Fargo Foundation

Hyderabad, India

Fresher

Save

Posted 19 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

About this role:

Wells Fargo is seeking a Senior Systems Operations Engineer.

In this role, you will:

Lead or participate in managing all installed systems and infrastructure within the Systems Operations functional area
Contribute in increasing system efficiencies and lowering the human intervention time on related tasks
Review and analyze moderately complex operational support systems, application software, and system management tools to ensure the highest levels of systems and infrastructure availability
Work with vendors and other technical personnel for problem resolution
Lead team to meet technical deliverables while leveraging solid understanding of technical process controls or standards
Collaborate with vendors and other technical personnel to resolve technical issues and achieve highest levels of systems and infrastructure availability

Required Qualifications:

4+ years of Systems Engineering, Technology Architecture experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education

Desired Qualifications:

4+ years in Production Support / SRE / DevOps / Platform Operations for business-critical applications.
Proven track record supporting 24x7 platforms with strict SLAs and high availability requirements.
Experience working in ITIL-aligned environments (Incident, Problem, Change).
Strong troubleshooting skills across Linux/Unix, system processes, CPU/memory, threads, disk, network basics.
Working knowledge of application architectures: microservices, distributed systems, batch + online workloads.
Proficiency in log analysis and observability tools (e.g., Splunk/ELK, Grafana, Prometheus, AppDynamics, Dynatrace-any equivalent).
Solid understanding of HTTP, TLS, DNS, load balancing, reverse proxy, and typical failure patterns (timeouts, 503/504, connection pool saturation).
Hands-on with databases (Oracle / Postgres / SQL Server etc.): query basics, locks, slow queries, connection pooling, indexing concepts.
Familiarity with messaging/streaming systems (Kafka/RabbitMQ) and troubleshooting lag/offset/consumer issues (good-to-have).
Ability to write scripts for automation in Python / Shell / PowerShell.
Comfortable with runbooks, automation tools, CI/CD basics, and reducing manual toil. Understanding of SLO/SLI, monitoring, alert tuning, and reliability best practices.
Strong incident handling skills: triage, mitigation, communication, and structured follow-through.
Knowledge of RCA techniques (5 Whys, fishbone, timeline-based analysis) and converting findings into preventive actions.
Experience with change management and release support able to assess risk and enforce operational readiness.
Excellent written and verbal communication for stakeholder updates (technical + business-friendly). Ability to collaborate across Dev, QA, DBAs, Network, Cloud/Infra teams.
Calm under pressure, structured thinker, strong ownership. Bias for root-cause and prevention over repeated firefighting. High attention to detail and commitment to operational excellence.

Job Expectations:

Production Support & Incident Management - Provide L2 support for critical applications/services including triage, diagnosis, mitigation, and recovery. Lead or co-lead major incidents (P1/P2) troubleshooting and coordinate with relevant teams until service restoration. Maintain clear, timely incident communications (status updates, ETAs, impact, workaround). Ensure incidents are properly documented with timeline, actions taken, and next steps.
Monitoring, Alerting & Observability - Own service monitoring hygiene: reduce noise, tune alerts, and improve signal quality. Use metrics/logs/traces to quickly isolate failure domains (application, infra, DB, network, dependencies). Build/maintain dashboards for service health, SLIs (latency, error rate, throughput), and batch completion tracking.
Problem Management & RCA - Drive root cause analysis for recurring issues and high-severity incidents. Convert RCAs into measurable outcomes: bug tickets, automation, monitoring improvements, capacity fixes, and operational controls. Track corrective actions to closure and measure reduction in repeat incidents.
Release, Change & Operational Readiness - Support production releases: validation, smoke checks, rollback readiness, and post-release monitoring. Review changes for operational risk and ensure runbooks, alarms, dashboards, and rollback plans are in place. Participate in CAB/change reviews where applicable.
Automation & Reliability Engineering - Identify repetitive manual tasks and deliver automation to reduce toil (e.g., health checks, remediation scripts, self-healing steps). Improve MTTR via better runbooks, automation, and faster diagnostics. Contribute to reliability engineering initiatives: capacity planning inputs, performance tuning, resilience testing (where applicable).
Knowledge Management & Documentation - Create and maintain operational documentation: Runbooks / SOPs / troubleshooting guides Service dependency maps Known error database entries. Ensure documentation is usable during incidents (step-based, verified, and current).
On-call / Shift Support - Participate in on-call rotation and/or shift-based coverage as per business requirements. Handle escalations from L1 and provide coaching to improve L1 resolution rates.

Posting End Date:

14 Apr 2026

We Value Equal Opportunity

Wells Fargo is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, status as a protected veteran, or any other legally protected characteristic.

Employees support our focus on building strong customer relationships balanced with a strong risk mitigating and compliance-driven culture which firmly establishes those disciplines as critical to the success of our customers and company. They are accountable for execution of all applicable risk programs (Credit, Market, Financial Crimes, Operational, Regulatory Compliance), which includes effectively following and adhering to applicable Wells Fargo policies and procedures, appropriately fulfilling risk and compliance obligations, timely and effective escalation and remediation of issues, and making sound risk decisions. There is emphasis on proactive monitoring, governance, risk identification and escalation, as well as making sound risk decisions commensurate with the business unit's risk appetite and all risk and compliance program requirements.

Candidates applying to job openings posted in Canada: Applications for employment are encouraged from all qualified candidates, including women, persons with disabilities, aboriginal peoples and visible minorities. Accommodation for applicants with disabilities is available upon request in connection with the recruitment process.

Applicants with Disabilities

To request a medical accommodation during the application or interview process, visit.

Drug and Alcohol Policy

Wells Fargo maintains a drug free workplace. Please see our to learn more.

Wells Fargo Recruitment and Hiring Requirements:

a. Third-Party recordings are prohibited unless authorized by Wells Fargo.

b. Wells Fargo requires you to directly represent your own experiences during the recruiting and hiring process.

More Info

Job Type:

Permanent Job

Industry:

IT /Computers - Software

Employment Type:

Full time

About Company

The Wells Fargo FoundationJob Source: wd1.myworkdaysite.com

"Wells Fargo is committed to building an inclusive, sustainable recovery for all through a focus on opening pathways to economic advancement, championing quality, affordable homes, empowering small businesses to thrive, and enabling a just, low-carbon economy.
"

Job ID: 145673831

Jobs by Skill - IT