About This Role
Wells Fargo is seeking a Principal Engineer - Generative Gen AI GPU Infrastructure Capabilities.
In This Role, You Will
- Act as an advisor to leadership to develop or influence applications, network, information security, database, operating systems, or web technologies for highly complex business and technical needs across multiple groups
- Lead the strategy and resolution of highly complex and unique challenges requiring in-depth evaluation across multiple areas or the enterprise, delivering solutions that are long-term, large-scale and require vision, creativity, innovation, advanced analytical and inductive thinking
- Translate advanced technology experience, an in-depth knowledge of the organizations tactical and strategic business objectives, the enterprise technological environment, the organization structure, and strategic technological opportunities and requirements into technical engineering solutions
- Provide vision, direction and expertise to leadership on implementing innovative and significant business solutions
- Maintain knowledge of industry best practices and new technologies and recommends innovations that enhance operations or provide a competitive advantage to the organization
- Strategically engage with all levels of professionals and managers across the enterprise and serve as an expert advisor to leadership
Required Qualifications
- 7+ years of Engineering experience, or equivalent demonstrated through one or a combination of the following: work experience, training, military experience, education
Desired Qualifications
- Design GPU cluster topologies (H100/H200, NVLink/NVSwitch), networking, and storage paths for highthroughput inferencing; document sizing and perf baselines.
- Implement Run: AI constructs (Collections/Departments/Projects/workloads) for MDEV/MDEP/UCEP/MRM; codify quota, priority, and fairshare policies.
- POC & benchmark disaggregated inferencing (prefill/decode) with vLLM/TensorRTLLM; publish guidance for H100/H200 tuning (FP8/INT8/AWQ) and KVtransfer behavior over NVLink.
- Operationalize OpenShift AI parity for GPU scheduling, time slicing/MIG profiles, and preemption; validate upgrade paths and helm/kustomize packaging.
- Integrate Triton Inference Server for multimodel serving; standardize model repository structure, batching, dynamic shapes, and telemetry hooks. (Supported broadly by platform docs; add repo specifics when you share them.)
- Harden NGDC environments with AVI/GSLB patterns (Prod1/Prod2) and BCP; execute DR failover runbooks and steadystate capacity planning.
- Publish steadystate runbooks (deploy certify promote): DEV UAT MDEPBeta MDEPGA / UCEP; define promotion criteria and risk exceptions.
- Own endpoint product ionization via Apigee (AI Gateway)authN/Z, rate limiting, API SLAs, versioning/deprecation and SDK generation for internal consumers.
- Embed observability/evaluations with Overwatch + Arize: prompt/agent/tool tracing, SLO dashboards, alerting, and dataretention/export workflows.
- Automate CI/CD for infra and model artifacts: image scanning (JFrog remote repo), chart releases, canaries, and rollback plans across OCP/GKE.
- Tune CUDA kernels/graph execution paths; profile NCCL collectives; resolve performance bottlenecks (HBM bandwidth, kernel fusion, p2p comms). (NCCL inferred per assumption.)
- Qualify LLM/SLM runtimes (Gemma, Llama, GPTOSS, etc.) with Run: AI scheduling; publish permodel recipes for throughput, latency, cost and stability.
- Define GPU estate hygiene: image provenance, secrets handling, namespace/network policy baselines, and change controls for upgrades (e.g., Run: AI v2.21+).
- Partner with product/TPM/PO to align backlog to platform milestones (OpenShift AI goforward, SuperPOD activation waves, endpoint rollouts).
- Mentor engineers; lead deepdive reviews and present in exec/tech forums (CIO/ARB/offsites) with architecture readouts, performance data, and risk mitigations.
- NVIDIA & CUDA: CUDA/cuDNN usage, NVLink/NVSwitch understanding, MIG setup, NCCL tuning, GPU profiling, H100/H200 optimization. Optimize kernels and collectives, choose MIG profiles, validate interconnect bandwidth and NUMA/PCIe topology for LLM/SLM workloads.
- LLM/SLM Runtimes: Work with vLLM, TensorRTLLM, Triton; apply FP8/INT4 quantization; tune KVcache strategies. Build POCs for disaggregated prefill/decode, standardize Triton repos, and optimize batching.
- Orchestration: Use Run: AI structures (Collections/Departments/Projects), manage OCP/GKE environments. Implement GPU allocation patterns, enforce quotas, preemption, fairshare scheduling.
- OpenShift AI: Configure RHOAI GPU scheduling and time slicing, use helm/kustomize, validate upgrades. Achieve platform parity, certify charts and policies, ensure admission controls function reliably.
- API & Gateway: Apply Apigee authN/Z, manage quotas, rate limits, OpenAPI specs, SDK generation, SLA operations. Productionize model endpoints, manage versioning and deprecation, enforce gatewaylevel SLAs
- Observability & Evaluation: Use Overwatch + Arize for tracing and evals, define SLOs, alerts, retention/export processes. Trace prompts/tools/agents, enforce data retention, publish standardized dashboards.
- CI/CD & Artifacts: Manage JFrog repos, image scanning, helm releases, canary/rollback workflows. Standardize artifact flow, automate safe promotions, ensure compliant releases.
- Performance Engineering: Model throughput/latency, optimize token/sec, batch shaping, cache policies. Produce permodel performance recipes, tune cost/performance tradeoffs for LLM/SLM.
- Controls & SDLC: Apply JAD lite practices, manage change controls, secrets hygiene, namespace/network policies. Maintain compliance across GPU estate, ensure full auditability and proper access boundaries.
- Communication: Create executivefriendly narratives, write architectures and runbooks, present in forums. Deliver content in offsites/CIO forums, publish clear decision memos.
Reference Number
R-516638