GPU Kernel Optimization

Parent: ai-inference-engines
Related: inferact, radixark

Overview

GPU kernel optimization is the discipline of improving the efficiency of low-level compute kernels that run on GPU hardware during model inference. It's the foundational layer beneath inference engines like vLLM and SGLang.

How GPUs Run Models

When model.generate() is called, the GPU executes hundreds to thousands of kernels — small parallel functions such as:

GEMM (General Matrix Multiplication) — the heavy lifter
FlashAttention — efficient attention computation
LayerNorm — normalization
Activation functions (ReLU, GELU, SiGLU)

Bottleneck insight: GPUs are rarely compute-bound — they're usually memory bandwidth-bound. The core optimization challenge is minimizing data movement between HBM (High Bandwidth Memory) and compute units.

Key Optimization Techniques

1. Memory Access Optimization

Kernel fusion: Combine multiple operations into a single kernel to reduce HBM reads/writes
Triton compiler: Write fused kernels in Python-like syntax, compile to optimized CUDA
FlashAttention: IO-aware attention that reduces HBM traffic by ~10x

2. Quantization Kernels

Format	Precision	Speedup vs FP16	Quality Impact
INT8	8-bit	~2x	Minimal
INT4	4-bit	~4x	Moderate
FP8 (Hopper)	8-bit float	~2-3x	Very low

3. Tensor Parallelism Kernels

Split model weights across multiple GPUs
NVLink/NVSwitch for high-bandwidth inter-GPU communication
Critical for 70B+ models that don't fit on single GPU

4. FlashDecoding++ (无问芯穹)

Novel kernel-level optimization for the decode phase:

vs Hugging Face: 4.86x speedup (A100)
vs FlashDecoding: 1.37x speedup
vs vLLM: 1.24x speedup in decode stage

Inference Engine Comparison (Kernel-Level)

Engine	Quantization	Fusion	Multi-GPU	Custom Kernels
vLLM	AWQ, GPTQ, GGML	Yes	TP, PP	PagedAttention
SGLang	AWQ, GPTQ	Yes	TP, PP, EP	RadixAttention
LMDeploy	INT4, INT8 (TurboMind)	Yes (C++)	TP	TurboMind engine
TensorRT-LLM	FP8, INT8	Yes	TP, PP	Highly optimized
Inferact	(uses vLLM)	Yes	Yes	vLLM kernel stack
RadixArk	(uses SGLang)	Yes	Yes	SGLang kernel stack

Why It Matters

H100 cluster cost: ~$3-5/hr per GPU
Kernel optimization directly = lower inference cost per token
For 1B+ token/day deployments: 2x kernel speedup = $millions saved annually

Sources

vLLM paper (PagedAttention)
FlashAttention paper (Dao et al.)
无问芯穹 FlashDecoding++ technical disclosures