Search by job, company or skills

Hewlett Packard Enterprise

PCAI CoE Engineer

Early Applicant
  • 1 months ago
  • Be among the first 50 applicants

Job Description

PCAI CoE Engineer

This role has been designed as Hybrid with an expectation that you will work on average 2-3 days per week from an HPE office.

Who We Are:

Hewlett Packard Enterprise is the global edge-to-cloud company advancing the way people live and work. We help companies connect, protect, analyze, and act on their data and applications wherever they live, from edge to cloud, so they can turn insights into outcomes at the speed required to thrive in today's complex world.Our culture thrives onfinding new and better ways to accelerate what's next.We know diverse backgrounds are valued and succeed here. We have the flexibility to manage our work and personal needs.We make bold moves, together, and are a force for good. If you are looking to stretch and grow your career our culture will embrace you.Open up opportunities with HPE.

Job Description:

HPE Operations is our innovative IT services organization. It provides the expertise to advise, integrate, and accelerate our customers outcomes from their digital transformation. Our teams collaborate to transform insight into innovation. In today's fast paced, hybrid IT world, being at business speed means overcoming IT complexity to match the speed of actions to the speed of opportunities. Deploy the right technology to respond quickly to market possibilities. Join us and redefine what's next for you.

What you will do

Job Description:

HPE GSR Private Cloud AI Center of Excellence (PCAI CoE) Engineer is responsible for providing exceptional customer support for our enterprise and mission-critical customers of HPE PCAI Solutions. PCAI CoE Engineer would carry deep knowledge of NVIDIA GPUs, CUDA programming, and NVIDIA's AI software stack, as well as a good understanding of Container platforms and underlying infrastructure. PCAI CoE engineer will work closely with data scientists, software engineers, and product teams to implement AI workflows and high-performance computing tasks efficiently. The PCAI CoE Engineer assumes the technical ownership of the issue and works with various HPE Engineering teams involved in building the solution while resolving complex customer issues.

Key responsibilities of a PCAI CoE Engineer may include.

  • Resolution of complex problems in enterprise and mission-critical customers of PCAI
  • Assess and appreciate technical issues and customers business impact, and manage technical communication with customers and other stakeholders.
  • Providing leadership in complex technical problem management, working closely with end customers and HPE remote and field support staff
  • Identifying and resolving customer issues, particularly with NVIDIA GPUs and related infrastructure components critical to AI processing.
  • Troubleshooting and resolving issues with AI workflows leveraging CUDA, cuDNN, TensorRT, and NVIDIA Deep Learning frameworks.
  • Provide technical expertise in deep learning model deployment using NVIDIA GPUs in cloud and on-premise environments.
  • Work with stakeholders to identify opportunities for leveraging AI technologies to drive business solutions.
  • Troubleshoot and resolve performance bottlenecks in AI workflows involving NVIDIA GPU acceleration.
  • Technical support of support of Kubernetes (K8s) environments, and well-versed in AI tooling such as Apache Airflow for data orchestration and MLFlow for machine learning lifecycle management.
  • Troubleshoot and optimize AI applications and infrastructure for maximum efficiency and minimal downtime.
  • Fault isolation, Problem reproduction, interacting with the engineering teams, QA, and development engineers, Escalation and Elevation management
  • Development of knowledge content and runbooks

Knowledge & Skills Required

AI/ML Skills

Good understanding of AI/ML and Analytics applications such as

  • Kubeflow & MLflow, MLOPS - Tool (any Tools)
  • Apache Sparck and Superset
  • Ray, Feast, EzPresto data source,
  • Data Lake
  • Mlops Frame Works
  • MLDE (Determined AI) -optio
  • NVIDIA AI Enterprise NIM Microservices, Models, LLM, CUDA
  • NVIDIA Neural Modules (NeMo) - optional

Excellent knowledge on below platform components

  • Linux operating system (RHEL 8/Rocky/Ubuntu/Centos/Suse)
  • Kubernetes, container runtimes and Container networking, Creating Docker Images
  • Troubleshooting K8s Cluster issue
  • Ezmeral-specific Kubernetes: ezkube, ezfab etc.
  • Single Sign-on and IAM
  • Postgres database - option
  • Helm, Istio and Spire - (Istio - Service Mesh)
  • Storage and CSI, CNI, operators (File Storage/ Block Storage/Object Storage)
  • Troubleshooting experience on CSI & CNI
  • Container base storage access protocol

NVIDIA GPU, NVIDIA AI and related software's

  • Good Knowledge of GPU technologies, NVIDIA GPU operator, NVIDIA vGPU technology
  • Strong GPU Understanding and troubleshooting skills at the HW, OS, SW and Application layers.
  • Experience with NVIDIA SDKs (e.g., DeepStream, Jetson, etc.) and GPU performance tuning.
  • Experience with NVIDIA Jetson for edge AI development.
  • Knowledge of MLOps and experience with AI model deployment pipelines.
  • Familiarity with containerization and deployment using Docker and Kubernetes on GPU-powered systems.
  • Familiarity with NVIDIA's AI software stack, including Triton Inference Server, NVIDIA Clara, and NVIDIA Isaac and scalable AI workflows leveraging CUDA, cuDNN, TensorRT, and NVIDIA Deep Learning frameworks
  • Experience with cloud platforms such as AWS, Azure, or Google Cloud for NVIDIA GPU-based AI model deployment.
  • performance profiling, tuning, and optimization of AI applications on NVIDIA GPUs.

OS, Networking & Virtualization

Excellent understanding of

  • VMware vCenter + ESXi 7 & 8, , Switching. Layer 2 Networking, Cluster management
    HA, DRS
  • Storage access protocol for VMware
  • Content libraries and OVA/Template management & deployment
  • Qumless OS
  • VMware VMFS datastore management
  • Knowledge on VMware standard vswitches, VMkernel interfaces and VDS would be a bonus
  • NFS storage configuration and troubleshooting would be desired

Other skills

  • Good knowledge and hands-on experience with at least two various Linux distributions like RHEL, SLES, Ubuntu, and Debian.
  • Knowledge and experience with Linux System Administration, package management, scheduling, boot procedures/troubleshooting, performance optimization, and networking concepts.
  • Windows AD administration (user management for EZ authentication integration)
  • IPV6 + SLAAC

Common skills and qualifications

  • Education: A bachelor's or master's degree in computer science, information technology, or a related field is preferred.
  • Problem-Solving Skills: Excellent problem-solving skills and the ability to diagnose and resolve complex technical issues.
  • Communication Skills: Effective communication skills to collaborate with other teams, including development, security, and compliance teams.
  • Collaboration Skills: The ability to work effectively in a team environment and to coordinate efforts with other teams to resolve issues and implement new solutions.
  • IT Service Management Experience: Familiarity with IT service management (ITSM) frameworks, such as ITIL, and experience with incident, problem, and change management processes.

Additional Skills:

What We Can Offer You:

Health & Wellbeing

We strive to provide our team members and their loved ones with a comprehensive suite of benefits that supports their physical, financial and emotional wellbeing.

Personal & Professional Development

We also invest in your career because the better you are, the better we all are. We have specific programs catered to helping you reach any career goals you have - whether you want to become a knowledge expert in your field or apply your skills to another division.

Diversity, Inclusion & Belonging

We are unconditionally inclusive in the way we work and celebrate individual uniqueness. We know diverse backgrounds are valued and succeed here. We have the flexibility to manage our work and personal needs. We make bold moves, together, and are a force for good.

Let's Stay Connected:

Follow on Instagram to see the latest on people, culture and tech at HPE.

#india

#operations

Job:

Services

Job Level:

TCP_04

HPE is an Equal Employment Opportunity/ Veterans/Disabled/LGBT and Affirmative Action employer. We are committed to diversity and building a team that represents a variety of backgrounds, perspectives, and skills. We do not discriminate and all decisions we make are made on the basis of qualifications, merit, and business need. Our goal is to be one global diverse team that is representative of our customers, in an inclusive environment where we can continue to innovate and grow together. Please click here: .

Hewlett Packard Enterprise is EEO F/M/Protected Veteran/ Individual with Disabilities.

HPE will comply with all applicable laws related to employer use of arrest and conviction records, including laws requiring employers to consider for employment qualified applicants with criminal histories.

More Info

Function:technology

Job Type:Permanent Job

Date Posted: 10/10/2024

Job ID: 95751157

Report Job

About Company

Hi , want to stand out? Get your resume crafted by experts.

Similar Jobs

PCAI CoE Engineer

Hewlett Packard EnterpriseCompany Name Confidential

Associate Software Engineer QA CoE

Micron Semiconductor Asia Operations Pte LtdCompany Name Confidential
Last Updated: 08-11-2024 06:05:23 AM