Company

Fireworks AI

3P inference platform with proprietary closed-source FireAttention engine, targeting enterprise compound AI, valuation already past $5B.

1. Core Product / Service

Fireworks's product matrix revolves around one main line: serve open-source models with proprietary inference engine, bill per token.

  • Serverless Inference API: 100+ open-source model menu, including Llama 3.1/3.3 (8B/70B/405B), DeepSeek V3 / R1, Qwen2.5, Mistral, Mixtral, Gemma — billed per-million-token [1].
  • On-Demand / Dedicated Deployments: customer-exclusive GPU, billed per GPU-hour, avoiding serverless multi-tenant queues, suitable for stable QPS / private fine-tuned weights scenarios.
  • Fine-Tuning: LoRA + full fine-tuning, results deployable on serverless without additional hosting fees.
  • FireOptimizer / Compound AI: tool layer orchestrating multi-model, multi-step inference, function calling into production pipelines — Fireworks's differentiated product entering the enterprise market.
  • FireAttention engine: proprietary, closed-source, claiming 4× faster than vLLM in FP8 / FP16 (own benchmarks, with active quantization) [5].

2. Target Users & Pain Points

  • Enterprise AI teams: don't want to be locked into OpenAI / Anthropic API, but need SLA, multi-region, private data — Fireworks provides single API switching across hundreds of open-source models + dedicated clusters.
  • High-volume SaaS / Agent companies: at >10B tokens monthly consumption, 1P API prices are unbearable. Fireworks is ~10× cheaper than GPT-4o at Llama 70B-equivalent capability (own marketing material) [4].
  • Pain points: self-deploying vLLM needs kernel tuning + multi-node orchestration + autoscaling; Fireworks abstracts these into a single API while retaining dedicated options for customers needing isolation.

3. Competitive Landscape

Competitor Positioning Vs. Fireworks
together-ai Proprietary Kernel Collection (Tri Dao), serverless + GPU clusters Direct rival; Together has larger model menu + 1000-GPU cluster product; Fireworks has deeper compound AI / agent orchestration
inferact vLLM commercialization, open-source ecosystem binding Inferact is open-source camp; Fireworks closed-source engine controls optimization path
radixark SGLang commercialization Also an engine-camp rival; Fireworks more "productized"
Groq Proprietary LPU hardware Different dimensions of competition (hardware vs software)
DeepInfra Ultra-low-price serverless DeepInfra cheaper but weaker enterprise product; Fireworks more high-end
AWS Bedrock Cloud-managed Bedrock wins on overall distribution, Fireworks wins on performance / engine depth

Differentiation: FireAttention engine + Compound AI orchestration is Fireworks's dual selling point. Peers mostly compete on speed / price; Fireworks packages "production-ready agent / function-calling pipeline" to sell to enterprises.

4. Unique Observations

  • Per-token pricing (serverless, 2026-05 public): Llama 3.1 8B ~$0.20/M tokens; Llama 3.1 70B ~$0.90/M (input/output blended); Llama 3.1 405B ~$3/M; DeepSeek V3 ~$0.90/M; Qwen2.5 72B ~$0.90/M [1].
  • vs 1P price gap: Llama 3.1 70B @ ~$0.90/M vs GPT-4o @ ~$5/M input + $15/M output → blended ~$10/M, price gap ~10×. But capability not fully equivalent: Llama 70B general reasoning still trails GPT-4o; trade-off works only when task tolerance is high.
  • vs Together: same Llama 70B both at ~$0.88-0.90/M, prices face-glued; competition point isn't price or volume but engine efficiency and enterprise product depth.
  • Inference engine: closed-source proprietary FireAttention (not vLLM / SGLang). Means Fireworks must 100% do adaptation work for every new hardware / new model architecture — both burden and moat.
  • Compute sourcing: doesn't self-build L1, mainly rents H100 / H200 from L2 hyperscalers like coreweave / Oracle / GCP, doing capacity scheduling and kernel-layer optimization itself. take rate ≈ (token sale price - GPU rental cost) / token sale price; not publicly disclosed but industry estimates serverless gross margin 30-50%, dedicated lower.
  • Compound AI is hedge: if open-source vs closed-source model gap narrows in future, and token price war worsens (pure inference commoditized), Fireworks wants to start fresh in "agent / pipeline tool layer" — similar to Snowflake's early script of leaning toward data app platform when cloud DW was commoditizing.

5. Financials / Funding

Round Date Amount Valuation Lead
Seed 2022 Sequoia
Series A 2023-07 $25M Benchmark
Series B 2024-07 $52M $552M post Sequoia [2]
Series C 2025-07 (reported) ~$200M ~$5.5B Multiple growth funds (Reuters report) [3]
  • Founded: 2022 (ex-Meta PyTorch team members departing)
  • Total funding estimate: ~$300M+
  • Customer count: self-reported "thousands of enterprise customers" (including DoorDash, Quora public cases), ARR undisclosed

Note: Section 4 prompt's "$10B val" differs from Reuters reported ~$5.5B (2025-07); this page uses the Reuters-anchored $5.5B; if a 2026 new round pushes valuation to $10B, update per news.

6. People & Relationships

  • CEO / Founder: Lin Qiao — former Meta PyTorch team lead, key figure in PyTorch Distributed / Inference; Fireworks's "proprietary engine" narrative largely built on PyTorch alumni network.
  • Investors: Sequoia, Benchmark, NVIDIA, AMD, MongoDB Ventures (strategic); Databricks Ventures (reported).
  • Competes with: together-ai, inferact, radixark, DeepInfra, Groq, Anyscale.
  • Partners with: NVIDIA (GPU + early hardware), AMD MI300X adaptation, MongoDB (vector integration).
  • Hosts models from: Meta (Llama), deepseek, Mistral, Alibaba (Qwen), Google (Gemma).

Sources

Last compiled: 2026-05-10