Fireworks AI

3P inference platform with proprietary closed-source FireAttention engine, targeting enterprise compound AI, valuation already past $5B.

1. Core Product / Service

Fireworks's product matrix revolves around one main line: serve open-source models with proprietary inference engine, bill per token.

Serverless Inference API: 100+ open-source model menu, including Llama 3.1/3.3 (8B/70B/405B), DeepSeek V3 / R1, Qwen2.5, Mistral, Mixtral, Gemma — billed per-million-token [1].
On-Demand / Dedicated Deployments: customer-exclusive GPU, billed per GPU-hour, avoiding serverless multi-tenant queues, suitable for stable QPS / private fine-tuned weights scenarios.
Fine-Tuning: LoRA + full fine-tuning, results deployable on serverless without additional hosting fees.
FireOptimizer / Compound AI: tool layer orchestrating multi-model, multi-step inference, function calling into production pipelines — Fireworks's differentiated product entering the enterprise market.
FireAttention engine: proprietary, closed-source, claiming 4× faster than vLLM in FP8 / FP16 (own benchmarks, with active quantization) [5].

2. Target Users & Pain Points

Enterprise AI teams: don't want to be locked into OpenAI / Anthropic API, but need SLA, multi-region, private data — Fireworks provides single API switching across hundreds of open-source models + dedicated clusters.
High-volume SaaS / Agent companies: at >10B tokens monthly consumption, 1P API prices are unbearable. Fireworks is ~10× cheaper than GPT-4o at Llama 70B-equivalent capability (own marketing material) [4].
Pain points: self-deploying vLLM needs kernel tuning + multi-node orchestration + autoscaling; Fireworks abstracts these into a single API while retaining dedicated options for customers needing isolation.

3. Competitive Landscape

Competitor	Positioning	Vs. Fireworks
together-ai	Proprietary Kernel Collection (Tri Dao), serverless + GPU clusters	Direct rival; Together has larger model menu + 1000-GPU cluster product; Fireworks has deeper compound AI / agent orchestration
inferact	vLLM commercialization, open-source ecosystem binding	Inferact is open-source camp; Fireworks closed-source engine controls optimization path
radixark	SGLang commercialization	Also an engine-camp rival; Fireworks more "productized"
Groq	Proprietary LPU hardware	Different dimensions of competition (hardware vs software)
DeepInfra	Ultra-low-price serverless	DeepInfra cheaper but weaker enterprise product; Fireworks more high-end
AWS Bedrock	Cloud-managed	Bedrock wins on overall distribution, Fireworks wins on performance / engine depth

Differentiation: FireAttention engine + Compound AI orchestration is Fireworks's dual selling point. Peers mostly compete on speed / price; Fireworks packages "production-ready agent / function-calling pipeline" to sell to enterprises.

4. Unique Observations

Per-token pricing (serverless, 2026-05 public): Llama 3.1 8B ~$0.20/M tokens; Llama 3.1 70B ~$0.90/M (input/output blended); Llama 3.1 405B ~$3/M; DeepSeek V3 ~$0.90/M; Qwen2.5 72B ~$0.90/M [1].
vs 1P price gap: Llama 3.1 70B @ ~$0.90/M vs GPT-4o @ ~$5/M input + $15/M output → blended ~$10/M, price gap ~10×. But capability not fully equivalent: Llama 70B general reasoning still trails GPT-4o; trade-off works only when task tolerance is high.
vs Together: same Llama 70B both at ~$0.88-0.90/M, prices face-glued; competition point isn't price or volume but engine efficiency and enterprise product depth.
Inference engine: closed-source proprietary FireAttention (not vLLM / SGLang). Means Fireworks must 100% do adaptation work for every new hardware / new model architecture — both burden and moat.
Compute sourcing: doesn't self-build L1, mainly rents H100 / H200 from L2 hyperscalers like coreweave / Oracle / GCP, doing capacity scheduling and kernel-layer optimization itself. take rate ≈ (token sale price - GPU rental cost) / token sale price; not publicly disclosed but industry estimates serverless gross margin 30-50%, dedicated lower.
Compound AI is hedge: if open-source vs closed-source model gap narrows in future, and token price war worsens (pure inference commoditized), Fireworks wants to start fresh in "agent / pipeline tool layer" — similar to Snowflake's early script of leaning toward data app platform when cloud DW was commoditizing.

5. Financials / Funding

Round	Date	Amount	Valuation	Lead
Seed	2022	—	—	Sequoia
Series A	2023-07	$25M	—	Benchmark
Series B	2024-07	$52M	$552M post	Sequoia [2]
Series C	2025-07 (reported)	~$200M	~$5.5B	Multiple growth funds (Reuters report) [3]

Founded: 2022 (ex-Meta PyTorch team members departing)
Total funding estimate: ~$300M+
Customer count: self-reported "thousands of enterprise customers" (including DoorDash, Quora public cases), ARR undisclosed

Note: Section 4 prompt's "$10B val" differs from Reuters reported ~$5.5B (2025-07); this page uses the Reuters-anchored $5.5B; if a 2026 new round pushes valuation to $10B, update per news.

6. People & Relationships

CEO / Founder: Lin Qiao — former Meta PyTorch team lead, key figure in PyTorch Distributed / Inference; Fireworks's "proprietary engine" narrative largely built on PyTorch alumni network.
Investors: Sequoia, Benchmark, NVIDIA, AMD, MongoDB Ventures (strategic); Databricks Ventures (reported).
Competes with: together-ai, inferact, radixark, DeepInfra, Groq, Anyscale.
Partners with: NVIDIA (GPU + early hardware), AMD MI300X adaptation, MongoDB (vector integration).
Hosts models from: Meta (Llama), deepseek, Mistral, Alibaba (Qwen), Google (Gemma).

Sources

[1] https://fireworks.ai/pricing (2026-05-10)
[2] https://fireworks.ai/blog/fireworks-raises-52m-series-b (2026-05-10)
[3] https://www.reuters.com/technology/artificial-intelligence/fireworks-ai-valued-552-billion-latest-funding-round-2025-07-11/ (2026-05-10)
[4] https://northflank.com/blog/fireworks-ai-vs-together-ai (2026-05-10)
[5] https://fireworks.ai/blog/fire-attention-serving-open-source-models-4x-faster-than-vllm-by-quantizing-with-no-tradeoffs (2026-05-10)