C++Developer || AI || HPC || Bangalore

ksa inc

Bengaluru, India

3-7 Years

Save

Posted an hour ago
Be among the first 10 applicants

Early Applicant

Job Description

We are seeking an experienced C++ AI Inference Engineer to design, optimize, and deploy high-performance AI inference engines using modern C++ and processor-specific optimizations. You will collaborate with research teams to productionize cutting-edge AI model architectures for CPU-based inference.

Key Responsibilities

p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Collaborate with research teams to understand AI model architectures and requirements
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Design and implement AI model inference pipelines using C++17/20 and SIMD intrinsics (AVX2/AVX-512)
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Optimize cache hierarchy, NUMA-aware memory allocation, and matrix multiplication (GEMM) kernels
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Develop operator fusion techniques and CPU inference engines for production workloads
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Write production-grade, thread-safe C++ code with comprehensive unit testing
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Profile and debug performance using Linux tools (perf, VTune, flamegraphs)
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Conduct code reviews and ensure compliance with coding standards
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Stay current with HPC, OpenMP, and modern C++ best practices

Required Technical Skills

Core Requirements:

p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Modern C++ (C++17/20) with smart pointers, coroutines, and concepts
p]:pt-0 [&>p]:mb-2 [&>p]:my-0>

SIMD Intrinsics - AVX2 Required, AVX-512 Strongly Preferred

p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Cache optimization - L1/L2/L3 prefetching and locality awareness
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> NUMA-aware programming for multi-socket systems
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> GEMM/blocked matrix multiplication kernel implementation
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> OpenMP 5.0+ for parallel computing
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Linux performance profiling (perf, valgrind, sanitizers)

Strongly Desired

p]:pt-0 [&>p]:mb-2 [&>p]:my-0> High-performance AI inference engine development
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Operator fusion and kernel fusion techniques
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> HPC (High-Performance Computing) experience
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Memory management and allocation optimization

Qualifications

p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Bachelor's/Master's in Computer Science, Electrical Engineering, or related field
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> 3-7+ years proven C++ development experience
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Linux/Unix expertise with strong debugging skills
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Familiarity with Linear Algebra, numerical methods, and performance analysis
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Experience with multi-threading, concurrency, and memory management
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Strong problem-solving and analytical abilities

Preferred Qualifications

p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Knowledge of PyTorch/TensorFlow C++ backends
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Real-time systems or embedded systems background
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> ARM SVE, RISC-V vector extensions, or Intel ISPC experience

What You Will Work On

p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Production-grade AI inference libraries powering LLMs and vision models
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> CPU-optimized inference pipelines for sub-millisecond latency
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Cross-platform deployment across Intel Xeon, AMD EPYC, and ARM architectures
p]:pt-0 [&>p]:mb-2 [&>p]:my-0> Performance optimizations reducing inference costs by 3-5x

Skills: multithreading,simd,hpc,c++,high performance computing (hpc)