AI Inference Engine Landscape
Related: inferact, radixark, gpu-kernel-optimization
Overview
AI inference engines are the infrastructure layer that serves trained LLM models to users at scale. The market is rapidly commercializing around two dominant open-source projects: vLLM and SGLang.
Market Structure (2026)
Open-Source Leaders
| Project | Stars | Commercial Entity | Valuation | Lead Investor |
|---|---|---|---|---|
| vLLM | ~65K | Inferact | $800M | a16z + Lightspeed |
| SGLang | ~16K | RadixArk | $400M | Accel |
Other Players
| Engine | Developer | Notes |
|---|---|---|
| TensorRT-LLM | NVIDIA | Most optimized for NVIDIA hardware, closed-ish |
| LMDeploy | Shanghai AI Lab (InternLM) | Strong INT4, TurboMind C++ engine |
| Xinference | Xorbits (阿里系) | Chinese market, distributed inference |
| Fireworks AI | Fireworks Inc. | $10B+ valuation, own engine |
DeepSeek's Strategic Choice
DeepSeek (models V3, R1, V3-0324) chose to contribute optimizations back to vLLM rather than building their own inference engine.
Logic:
- DeepSeek is a model company, not infra company — would cost them a team to maintain an engine
- vLLM has the largest deployment base — contributing to vLLM = DeepSeek models reach more users
- vLLM is hardware-agnostic — DeepSeek benefits regardless of what hardware users have
AI Lab Official Recommendations
| Lab | Models | Recommended Engines |
|---|---|---|
| DeepSeek | V3, R1, V3-0324 | SGLang (Day-0) + vLLM |
| Meta | Llama 4 | vLLM + SGLang + TensorRT-LLM |
| Gemma 3/4 | vLLM | |
| Mistral | Mistral Large 3 | vLLM + SGLang |
| Moonshot | Kimi K2, K2.5 | vLLM + SGLang |
Key Metrics
| Metric | SGLang | vLLM |
|---|---|---|
| H100 throughput | ~16,200 tok/s | ~12,500 tok/s |
| Multi-GPU scaling | TP + PP + EP | TP + PP |
| MoE support | Yes (DeepSeek V3/R1) | Yes |
| FP8 support | Partial | Yes (Hopper) |
Business Model
All commercial players follow: Open-source free + Enterprise managed services paid
Services charged: SLA guarantees, dedicated GPU clusters, commercial support, hardware co-development.
Sources
- Inferact $150M seed round coverage (Fintool, Pulse2, a16z)
- SGLang GitHub: lmsys-org/sglang
- DeepSeek official model cards
- H100 benchmark data from various inference tests