Principal Engineer - Generative AI Infra Capabilities

Wells Fargo

Bengaluru, India

7-9 Years

Save

Posted 5 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

About This Role

Wells Fargo is seeking a Principal Engineer - Generative Gen AI GPU Infrastructure Capabilities.

In This Role, You Will

Act as an advisor to leadership to develop or influence applications, network, information security, database, operating systems, or web technologies for highly complex business and technical needs across multiple groups
Lead the strategy and resolution of highly complex and unique challenges requiring in-depth evaluation across multiple areas or the enterprise, delivering solutions that are long-term, large-scale and require vision, creativity, innovation, advanced analytical and inductive thinking
Translate advanced technology experience, an in-depth knowledge of the organizations tactical and strategic business objectives, the enterprise technological environment, the organization structure, and strategic technological opportunities and requirements into technical engineering solutions
Provide vision, direction and expertise to leadership on implementing innovative and significant business solutions
Maintain knowledge of industry best practices and new technologies and recommends innovations that enhance operations or provide a competitive advantage to the organization
Strategically engage with all levels of professionals and managers across the enterprise and serve as an expert advisor to leadership

Required Qualifications

7+ years of Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education

Desired Qualifications

Design GPU cluster topologies (H100/H200, NVLink/NVSwitch), networking, and storage paths for highthroughput inferencing; document sizing and perf baselines.
Implement Run: AI constructs (Collections/Departments/Projects/workloads) for MDEV/MDEP/UCEP/MRM; codify quota, priority, and fairshare policies.
POC & benchmark disaggregated inferencing (prefill/decode) with vLLM/TensorRTLLM; publish guidance for H100/H200 tuning (FP8/INT8/AWQ) and KVtransfer behavior over NVLink.
Operationalize OpenShift AI parity for GPU scheduling, time slicing/MIG profiles, and preemption; validate upgrade paths and helm/kustomize packaging.
Integrate Triton Inference Server for multimodel serving; standardize model repository structure, batching, dynamic shapes, and telemetry hooks. (Supported broadly by platform docs; add repo specifics when you share them.)
Harden NGDC environments with AVI/GSLB patterns (Prod1/Prod2) and BCP; execute DR failover runbooks and steadystate capacity planning.
Publish steadystate runbooks (deploy certify promote): DEV UAT MDEPBeta MDEPGA / UCEP; define promotion criteria and risk exceptions.
Own endpoint product ionization via Apigee (AI Gateway)authN/Z, rate limiting, API SLAs, versioning/deprecation and SDK generation for internal consumers.
Embed observability/evaluations with Overwatch + Arize: prompt/agent/tool tracing, SLO dashboards, alerting, and dataretention/export workflows.
Automate CI/CD for infra and model artifacts: image scanning (JFrog remote repo), chart releases, canaries, and rollback plans across OCP/GKE.
Tune CUDA kernels/graph execution paths; profile NCCL collectives; resolve performance bottlenecks (HBM bandwidth, kernel fusion, p2p comms). (NCCL inferred per assumption.)
Qualify LLM/SLM runtimes (Gemma, Llama, GPTOSS, etc.) with Run: AI scheduling; publish permodel recipes for throughput, latency, cost and stability.
Define GPU estate hygiene: image provenance, secrets handling, namespace/network policy baselines, and change controls for upgrades (e.g., Run: AI v2.21+).
Partner with product/TPM/PO to align backlog to platform milestones (OpenShift AI goforward, SuperPOD activation waves, endpoint rollouts).
Mentor engineers; lead deepdive reviews and present in exec/tech forums (CIO/ARB/offsites) with architecture readouts, performance data, and risk mitigations.
NVIDIA & CUDA: CUDA/cuDNN usage, NVLink/NVSwitch understanding, MIG setup, NCCL tuning, GPU profiling, H100/H200 optimization. Optimize kernels and collectives, choose MIG profiles, validate interconnect bandwidth and NUMA/PCIe topology for LLM/SLM workloads.
LLM/SLM Runtimes: Work with vLLM, TensorRTLLM, Triton; apply FP8/INT4 quantization; tune KVcache strategies. Build POCs for disaggregated prefill/decode, standardize Triton repos, and optimize batching.
Orchestration: Use Run: AI structures (Collections/Departments/Projects), manage OCP/GKE environments. Implement GPU allocation patterns, enforce quotas, preemption, fairshare scheduling.
OpenShift AI: Configure RHOAI GPU scheduling and time slicing, use helm/kustomize, validate upgrades. Achieve platform parity, certify charts and policies, ensure admission controls function reliably.
API & Gateway: Apply Apigee authN/Z, manage quotas, rate limits, OpenAPI specs, SDK generation, SLA operations. Productionize model endpoints, manage versioning and deprecation, enforce gatewaylevel SLAs
Observability & Evaluation: Use Overwatch + Arize for tracing and evals, define SLOs, alerts, retention/export processes. Trace prompts/tools/agents, enforce data retention, publish standardized dashboards.
CI/CD & Artifacts: Manage JFrog repos, image scanning, helm releases, canary/rollback workflows. Standardize artifact flow, automate safe promotions, ensure compliant releases.
Performance Engineering: Model throughput/latency, optimize token/sec, batch shaping, cache policies. Produce permodel performance recipes, tune cost/performance tradeoffs for LLM/SLM.
Controls & SDLC: Apply JAD lite practices, manage change controls, secrets hygiene, namespace/network policies. Maintain compliance across GPU estate, ensure full auditability and proper access boundaries.
Communication: Create executivefriendly narratives, write architectures and runbooks, present in forums. Deliver content in offsites/CIO forums, publish clear decision memos.