GPU Kernel Optimization
Parent: ai-inference-engines
Related: inferact, radixark
Overview
GPU kernel optimization is the discipline of improving the efficiency of low-level compute kernels that run on GPU hardware during model inference. It's the foundational layer beneath inference engines like vLLM and SGLang.
How GPUs Run Models
When model.generate() is called, the GPU executes hundreds to thousands of kernels — small parallel functions such as:
- GEMM (General Matrix Multiplication) — the heavy lifter
- FlashAttention — efficient attention computation
- LayerNorm — normalization
- Activation functions (ReLU, GELU, SiGLU)
Bottleneck insight: GPUs are rarely compute-bound — they're usually memory bandwidth-bound. The core optimization challenge is minimizing data movement between HBM (High Bandwidth Memory) and compute units.
Key Optimization Techniques
1. Memory Access Optimization
- Kernel fusion: Combine multiple operations into a single kernel to reduce HBM reads/writes
- Triton compiler: Write fused kernels in Python-like syntax, compile to optimized CUDA
- FlashAttention: IO-aware attention that reduces HBM traffic by ~10x
2. Quantization Kernels
| Format | Precision | Speedup vs FP16 | Quality Impact |
|---|---|---|---|
| INT8 | 8-bit | ~2x | Minimal |
| INT4 | 4-bit | ~4x | Moderate |
| FP8 (Hopper) | 8-bit float | ~2-3x | Very low |
3. Tensor Parallelism Kernels
- Split model weights across multiple GPUs
- NVLink/NVSwitch for high-bandwidth inter-GPU communication
- Critical for 70B+ models that don't fit on single GPU
4. FlashDecoding++ (无问芯穹)
Novel kernel-level optimization for the decode phase:
- vs Hugging Face: 4.86x speedup (A100)
- vs FlashDecoding: 1.37x speedup
- vs vLLM: 1.24x speedup in decode stage
Inference Engine Comparison (Kernel-Level)
| Engine | Quantization | Fusion | Multi-GPU | Custom Kernels |
|---|---|---|---|---|
| vLLM | AWQ, GPTQ, GGML | Yes | TP, PP | PagedAttention |
| SGLang | AWQ, GPTQ | Yes | TP, PP, EP | RadixAttention |
| LMDeploy | INT4, INT8 (TurboMind) | Yes (C++) | TP | TurboMind engine |
| TensorRT-LLM | FP8, INT8 | Yes | TP, PP | Highly optimized |
| Inferact | (uses vLLM) | Yes | Yes | vLLM kernel stack |
| RadixArk | (uses SGLang) | Yes | Yes | SGLang kernel stack |
Why It Matters
- H100 cluster cost: ~$3-5/hr per GPU
- Kernel optimization directly = lower inference cost per token
- For 1B+ token/day deployments: 2x kernel speedup = $millions saved annually
Sources
- vLLM paper (PagedAttention)
- FlashAttention paper (Dao et al.)
- 无问芯穹 FlashDecoding++ technical disclosures