Home/AI Infrastructure/GPU Kernel Optimization
EN中文

GPU Kernel Optimization

Parent: ai-inference-engines
Related: inferact, radixark

Overview

GPU kernel optimization is the discipline of improving the efficiency of low-level compute kernels that run on GPU hardware during model inference. It's the foundational layer beneath inference engines like vLLM and SGLang.

How GPUs Run Models

When model.generate() is called, the GPU executes hundreds to thousands of kernels — small parallel functions such as:

  • GEMM (General Matrix Multiplication) — the heavy lifter
  • FlashAttention — efficient attention computation
  • LayerNorm — normalization
  • Activation functions (ReLU, GELU, SiGLU)

Bottleneck insight: GPUs are rarely compute-bound — they're usually memory bandwidth-bound. The core optimization challenge is minimizing data movement between HBM (High Bandwidth Memory) and compute units.

Key Optimization Techniques

1. Memory Access Optimization

  • Kernel fusion: Combine multiple operations into a single kernel to reduce HBM reads/writes
  • Triton compiler: Write fused kernels in Python-like syntax, compile to optimized CUDA
  • FlashAttention: IO-aware attention that reduces HBM traffic by ~10x

2. Quantization Kernels

Format Precision Speedup vs FP16 Quality Impact
INT8 8-bit ~2x Minimal
INT4 4-bit ~4x Moderate
FP8 (Hopper) 8-bit float ~2-3x Very low

3. Tensor Parallelism Kernels

  • Split model weights across multiple GPUs
  • NVLink/NVSwitch for high-bandwidth inter-GPU communication
  • Critical for 70B+ models that don't fit on single GPU

4. FlashDecoding++ (无问芯穹)

Novel kernel-level optimization for the decode phase:

  • vs Hugging Face: 4.86x speedup (A100)
  • vs FlashDecoding: 1.37x speedup
  • vs vLLM: 1.24x speedup in decode stage

Inference Engine Comparison (Kernel-Level)

Engine Quantization Fusion Multi-GPU Custom Kernels
vLLM AWQ, GPTQ, GGML Yes TP, PP PagedAttention
SGLang AWQ, GPTQ Yes TP, PP, EP RadixAttention
LMDeploy INT4, INT8 (TurboMind) Yes (C++) TP TurboMind engine
TensorRT-LLM FP8, INT8 Yes TP, PP Highly optimized
Inferact (uses vLLM) Yes Yes vLLM kernel stack
RadixArk (uses SGLang) Yes Yes SGLang kernel stack

Why It Matters

  • H100 cluster cost: ~$3-5/hr per GPU
  • Kernel optimization directly = lower inference cost per token
  • For 1B+ token/day deployments: 2x kernel speedup = $millions saved annually

Sources

  • vLLM paper (PagedAttention)
  • FlashAttention paper (Dao et al.)
  • 无问芯穹 FlashDecoding++ technical disclosures