Company
Fireworks AI
3P inference platform with proprietary closed-source FireAttention engine, targeting enterprise compound AI, valuation already past $5B.
1. Core Product / Service
Fireworks's product matrix revolves around one main line: serve open-source models with proprietary inference engine, bill per token.
- Serverless Inference API: 100+ open-source model menu, including Llama 3.1/3.3 (8B/70B/405B), DeepSeek V3 / R1, Qwen2.5, Mistral, Mixtral, Gemma — billed per-million-token [1].
- On-Demand / Dedicated Deployments: customer-exclusive GPU, billed per GPU-hour, avoiding serverless multi-tenant queues, suitable for stable QPS / private fine-tuned weights scenarios.
- Fine-Tuning: LoRA + full fine-tuning, results deployable on serverless without additional hosting fees.
- FireOptimizer / Compound AI: tool layer orchestrating multi-model, multi-step inference, function calling into production pipelines — Fireworks's differentiated product entering the enterprise market.
- FireAttention engine: proprietary, closed-source, claiming 4× faster than vLLM in FP8 / FP16 (own benchmarks, with active quantization) [5].
2. Target Users & Pain Points
- Enterprise AI teams: don't want to be locked into OpenAI / Anthropic API, but need SLA, multi-region, private data — Fireworks provides single API switching across hundreds of open-source models + dedicated clusters.
- High-volume SaaS / Agent companies: at >10B tokens monthly consumption, 1P API prices are unbearable. Fireworks is ~10× cheaper than GPT-4o at Llama 70B-equivalent capability (own marketing material) [4].
- Pain points: self-deploying vLLM needs kernel tuning + multi-node orchestration + autoscaling; Fireworks abstracts these into a single API while retaining dedicated options for customers needing isolation.
3. Competitive Landscape
| Competitor | Positioning | Vs. Fireworks |
|---|---|---|
| together-ai | Proprietary Kernel Collection (Tri Dao), serverless + GPU clusters | Direct rival; Together has larger model menu + 1000-GPU cluster product; Fireworks has deeper compound AI / agent orchestration |
| inferact | vLLM commercialization, open-source ecosystem binding | Inferact is open-source camp; Fireworks closed-source engine controls optimization path |
| radixark | SGLang commercialization | Also an engine-camp rival; Fireworks more "productized" |
| Groq | Proprietary LPU hardware | Different dimensions of competition (hardware vs software) |
| DeepInfra | Ultra-low-price serverless | DeepInfra cheaper but weaker enterprise product; Fireworks more high-end |
| AWS Bedrock | Cloud-managed | Bedrock wins on overall distribution, Fireworks wins on performance / engine depth |
Differentiation: FireAttention engine + Compound AI orchestration is Fireworks's dual selling point. Peers mostly compete on speed / price; Fireworks packages "production-ready agent / function-calling pipeline" to sell to enterprises.
4. Unique Observations
- Per-token pricing (serverless, 2026-05 public): Llama 3.1 8B ~$0.20/M tokens; Llama 3.1 70B ~$0.90/M (input/output blended); Llama 3.1 405B ~$3/M; DeepSeek V3 ~$0.90/M; Qwen2.5 72B ~$0.90/M [1].
- vs 1P price gap: Llama 3.1 70B @ ~$0.90/M vs GPT-4o @ ~$5/M input + $15/M output → blended ~$10/M, price gap ~10×. But capability not fully equivalent: Llama 70B general reasoning still trails GPT-4o; trade-off works only when task tolerance is high.
- vs Together: same Llama 70B both at ~$0.88-0.90/M, prices face-glued; competition point isn't price or volume but engine efficiency and enterprise product depth.
- Inference engine: closed-source proprietary FireAttention (not vLLM / SGLang). Means Fireworks must 100% do adaptation work for every new hardware / new model architecture — both burden and moat.
- Compute sourcing: doesn't self-build L1, mainly rents H100 / H200 from L2 hyperscalers like coreweave / Oracle / GCP, doing capacity scheduling and kernel-layer optimization itself. take rate ≈ (token sale price - GPU rental cost) / token sale price; not publicly disclosed but industry estimates serverless gross margin 30-50%, dedicated lower.
- Compound AI is hedge: if open-source vs closed-source model gap narrows in future, and token price war worsens (pure inference commoditized), Fireworks wants to start fresh in "agent / pipeline tool layer" — similar to Snowflake's early script of leaning toward data app platform when cloud DW was commoditizing.
5. Financials / Funding
| Round | Date | Amount | Valuation | Lead |
|---|---|---|---|---|
| Seed | 2022 | — | — | Sequoia |
| Series A | 2023-07 | $25M | — | Benchmark |
| Series B | 2024-07 | $52M | $552M post | Sequoia [2] |
| Series C | 2025-07 (reported) | ~$200M | ~$5.5B | Multiple growth funds (Reuters report) [3] |
- Founded: 2022 (ex-Meta PyTorch team members departing)
- Total funding estimate: ~$300M+
- Customer count: self-reported "thousands of enterprise customers" (including DoorDash, Quora public cases), ARR undisclosed
Note: Section 4 prompt's "$10B val" differs from Reuters reported ~$5.5B (2025-07); this page uses the Reuters-anchored $5.5B; if a 2026 new round pushes valuation to $10B, update per news.
6. People & Relationships
- CEO / Founder: Lin Qiao — former Meta PyTorch team lead, key figure in PyTorch Distributed / Inference; Fireworks's "proprietary engine" narrative largely built on PyTorch alumni network.
- Investors: Sequoia, Benchmark, NVIDIA, AMD, MongoDB Ventures (strategic); Databricks Ventures (reported).
- Competes with: together-ai, inferact, radixark, DeepInfra, Groq, Anyscale.
- Partners with: NVIDIA (GPU + early hardware), AMD MI300X adaptation, MongoDB (vector integration).
- Hosts models from: Meta (Llama), deepseek, Mistral, Alibaba (Qwen), Google (Gemma).
Sources
- [1] https://fireworks.ai/pricing (2026-05-10)
- [2] https://fireworks.ai/blog/fireworks-raises-52m-series-b (2026-05-10)
- [3] https://www.reuters.com/technology/artificial-intelligence/fireworks-ai-valued-552-billion-latest-funding-round-2025-07-11/ (2026-05-10)
- [4] https://northflank.com/blog/fireworks-ai-vs-together-ai (2026-05-10)
- [5] https://fireworks.ai/blog/fire-attention-serving-open-source-models-4x-faster-than-vllm-by-quantizing-with-no-tradeoffs (2026-05-10)
Last compiled: 2026-05-10