AI Inference Engine Landscape
Related: inferact, radixark, gpu-kernel-optimization, ollama
Overview
AI inference engines are the infrastructure layer that serves trained LLM models to users at scale. The market is rapidly commercializing around three dominant engines: the two open-source leaders vLLM and SGLang, plus NVIDIA's TensorRT-LLM.
A structural shift is now visible across the layer. The raw feature gap between engines has largely closed — all three implement continuous batching, paged KV cache, and FP8 — so the competition has moved up the stack in two directions at once. Downstream, the authors of every major engine now sit behind VC-backed companies selling managed services, threatening the pure-platform middlemen. Above the engines, NVIDIA has introduced an orchestration layer (Dynamo) that schedules all three engines as interchangeable backends. Underneath it all, the unit of value is migrating from "renting GPUs" toward "selling tokens": Microsoft has indicated that roughly half of its GPU customers now consume capacity through AI APIs rather than reserved bare metal (only ~20–25%), and inference is projected to absorb about two-thirds of 2026 AI compute [https://siliconangle.com/ (2026-06-28)]. That migration pulls economic value toward the engine and serving layer — exactly where this module's players operate.
Market Structure (2026)
Open-Source Leaders
| Project | Stars | Commercial Entity | Valuation | Lead Investor |
|---|---|---|---|---|
| vLLM | ~65K | Inferact | $800M | a16z + Lightspeed |
| SGLang | ~16K | RadixArk | $400M | Accel |
Other Players
| Engine | Developer | Notes |
|---|---|---|
| TensorRT-LLM | NVIDIA | Most optimized for NVIDIA hardware, NVIDIA-only |
| LMDeploy | Shanghai AI Lab (InternLM) | Strong INT4, TurboMind C++ engine |
| Xinference | Xorbits (阿里系) | Chinese market, distributed inference |
| Fireworks AI | Fireworks Inc. | $10B+ valuation, own engine (fireworks-ai) |
The Engine Authors Go Commercial (2026)
The most important 2026 development is that the core authors of all three major engines now sit behind companies — and they are moving downstream to sell managed services, not just steward open-source projects. This is the "taxing upstream" pattern: whoever writes the engine has the deepest performance knowledge, and is best positioned to operate it as a hosted product. That directly threatens the pure-platform players (fireworks-ai, baseten) whose moat was operational expertise around an engine they did not author.
| Engine | Commercial vehicle | Round | Valuation | Lead investors |
|---|---|---|---|---|
| vLLM | inferact | $150M seed (Jan 2026) | $800M | a16z + Lightspeed [https://techcrunch.com/ (2026-06-28)] |
| SGLang | radixark | $100M seed (May 2026) | $400M | Accel (w/ NVIDIA, AMD, MediaTek + angels) [https://lmsys.org/blog/ (2026-06-28)] |
| TensorRT-LLM | NVIDIA (in-house) | — | — | — |
inferact was founded by Simon Mo and Woosuk Kwon out of UC Berkeley's Ion Stoica lab — the original vLLM authors — and plans a paid serverless vLLM offering. Separately, Red Hat had already acquired Neural Magic (Jan 2025), the lead commercial vLLM contributor, to anchor enterprise private-cloud deployment — so the vLLM ecosystem now has both a founder-led startup and a Red Hat enterprise track [https://techcrunch.com/ (2026-06-28)]. radixark was incubated at LMSYS and carries strategic money from three silicon vendors (NVIDIA, AMD, MediaTek), signaling that the hardware side wants the SGLang scheduler tuned across their chips [https://lmsys.org/blog/ (2026-06-28)]. TensorRT-LLM has no separate company — it stays in-house at NVIDIA as the on-ramp that sells more GPUs.
The Orchestration Layer: NVIDIA Dynamo
In March 2026 (GA at GTC), NVIDIA shipped Dynamo, a framework-agnostic "inference operating system" that schedules SGLang, vLLM, and TensorRT-LLM together rather than competing with them. Its headline capabilities are disaggregated prefill/decode (splitting the two phases across different GPUs), smart KV-cache-aware request routing, and multi-tier KV-cache offload — delivering up to ~7× throughput on Blackwell-class hardware [https://nvidianews.nvidia.com/ (2026-06-28)]. It has been adopted by baseten, fireworks-ai, and Deep Infra [https://www.baseten.co/blog/ (2026-06-28)].
The strategic logic is consistent with TensorRT-LLM's: NVIDIA gives away the orchestration software because engine-neutral, open infrastructure lowers the barrier to running large fleets, and more deployed fleets means more GPUs sold. Dynamo deliberately does not pick a winner among the engines — it commoditizes them into interchangeable backends underneath a scheduler NVIDIA controls.
Local / Single-User Engines
A distinct category from production serving engines — these are designed for running models on consumer hardware:
| Engine | Developer | Backend | Notes |
|---|---|---|---|
| ollama | Ollama Inc. | llama.cpp | One-command local serving, model registry, $20M funded 2026-04 |
| llama.cpp | Community (ggerganov) | Pure C++/CUDA/Metal | Maximum HW compatibility, GGUF quantization (2-8 bit), no built-in registry |
| MLX | Apple | Apple-native Metal | Best perf/watt on M-series Macs, SWA-native |
| LM Studio | LM Studio Inc. | llama.cpp | GUI + model browser, macOS/Windows |
Architectural relationship: Ollama and LM Studio are UX layers on top of llama.cpp; llama.cpp provides the C++ inference backend with GGUF quantization. MLX is a separate Apple-native stack that bypasses llama.cpp entirely on Apple Silicon. For production multi-user serving, vLLM/SGLang are the standard; local engines are for prototyping and single-user use cases.
Sliding Window Attention (SWA) Optimization
Multiple model families now use Sliding Window Attention to reduce KV cache memory pressure during long-text inference:
- Mimo-v2.5 (minimax): 60-layer SWA computing only 128-token windows. Long-text prefill computation equivalent to traditional 10-layer global GQA [local: 2026-05-30-summary.md].
- Gemma3 (Google): SWA auto-activates in supported engines, transparent to users.
- Qwen3 (qwen): Hybrid SWA architecture, user-transparent.
KV Cache memory formula: 2 × L × H_kv × D_h × T × B × bytes — where L = layers, H_kv = KV attention heads, D_h = head dimension, T = sequence length, B = batch size. Critical for deployment sizing of 1T+ MoE models like Kimi K2 (kimi).
Engine support: vLLM, SGLang, llama.cpp/Ollama, and MLX all support SWA models — the optimization is architectural (model-level), not engine-specific. When a model uses SWA, the engine automatically applies the sliding window, requiring no user configuration.
DeepSeek's Strategic Choice
DeepSeek (models V3, R1, V3-0324) chose to contribute optimizations back to vLLM rather than building their own inference engine.
Logic:
- DeepSeek is a model company, not infra company — would cost them a team to maintain an engine
- vLLM has the largest deployment base — contributing to vLLM = DeepSeek models reach more users
- vLLM is hardware-agnostic — DeepSeek benefits regardless of what hardware users have
AI Lab Official Recommendations
| Lab | Models | Recommended Engines |
|---|---|---|
| DeepSeek | V3, R1, V3-0324 | SGLang (Day-0) + vLLM |
| Meta | Llama 4 | vLLM + SGLang + TensorRT-LLM |
| Gemma 3/4 | vLLM | |
| Mistral | Mistral Large 3 | vLLM + SGLang |
| Moonshot | Kimi K2, K2.5 | vLLM + SGLang |
Key Metrics
| Metric | SGLang | vLLM |
|---|---|---|
| H100 throughput | ~16,200 tok/s | ~12,500 tok/s |
| Multi-GPU scaling | TP + PP + EP | TP + PP |
| MoE support | Yes (DeepSeek V3/R1) | Yes |
| FP8 support | Partial | Yes (Hopper) |
Engine Differentiation (2026)
With the core feature set (continuous batching, paged KV cache, FP8) now common to all three, differentiation has narrowed to a few distinct edges:
| Engine | Edge | Trade-off |
|---|---|---|
| vLLM | Broadest hardware/model coverage; predictable, stable behavior | Not always the raw-throughput leader on any single chip |
| SGLang | RadixAttention prefix caching + multi-call scheduling; wins on high-concurrency and MoE workloads; aggressive day-0 support for new hardware | Newer, faster-moving surface |
| TensorRT-LLM | Compiled-engine path → highest raw throughput on NVIDIA hardware | NVIDIA-only; compile step adds operational friction |
The practical read: pick vLLM for portability and breadth, SGLang for high-concurrency / MoE / new-silicon bring-up, and TensorRT-LLM when you are NVIDIA-locked and chasing peak tokens/sec.
Open vs Closed Among Platform Engines
The platform players that built their own engines split on how open they are:
- Fireworks (fireworks-ai) — FireAttention is fully proprietary and closed. The moat is performance alone; nothing is contributed back [https://fireworks.ai/blog (2026-06-28)].
- together-ai — its serving engine is closed, but it rests on FlashAttention (Tri Dao), which is open-source and now an industry-standard primitive. This is the "open foundation + closed monetization" posture: give the field the building block, keep the assembled product proprietary.
This contrast matters for durability. A pure performance moat (FireAttention) erodes as the open engines close the gap; an "open foundation" posture (FlashAttention under Together) earns ecosystem goodwill and standard-setting leverage while still monetizing the integrated stack.
Business Model
All commercial players follow: Open-source free + Enterprise managed services paid
Services charged: SLA guarantees, dedicated GPU clusters, commercial support, hardware co-development.
The 2026 twist is who gets to charge. As the engine authors move downstream (inferact, radixark) and NVIDIA commoditizes engine choice from above (Dynamo), the squeezed position is the pure platform that neither authored an engine nor owns silicon — it must compete on operations against the people who wrote the code it runs. The value migration from renting GPUs to selling tokens reinforces this: the margin accrues to whoever owns the serving/engine layer where tokens are actually produced.
Sources
- Inferact $150M seed round coverage (Fintool, Pulse2, a16z)
- SGLang GitHub: lmsys-org/sglang
- DeepSeek official model cards
- H100 benchmark data from various inference tests
- local: 2026-05-30-summary.md — SWA optimization, KV cache formula, Ollama/vLLM/llama.cpp/MLX landscape
- local: 2026-05-31-ai-infrastructure.md — raw research notes
- LMSYS blog — SGLang / RadixArk $100M seed, engine differentiation (https://lmsys.org/blog/, 2026-06-28)
- TechCrunch — Inferact $150M seed @ $800M, Red Hat / Neural Magic (https://techcrunch.com/, 2026-06-28)
- SiliconANGLE — compute-vs-token economics, Microsoft GPU customer mix (https://siliconangle.com/, 2026-06-28)
- NVIDIA Newsroom — Dynamo 1.0 GA, disaggregated serving, KV-cache routing (https://nvidianews.nvidia.com/, 2026-06-28)
- Baseten blog — Dynamo adoption (https://www.baseten.co/blog/, 2026-06-28)
- Fireworks blog — FireAttention proprietary engine (https://fireworks.ai/blog, 2026-06-28)