Company
DeepInfra
Minimalist pricing, ultra-broad menu, budget-tier 3P inference — frequently the market's lowest-price-range player for the same open-source model.
1. Core Product / Service
- Serverless Token API: hundreds of open-source models one-click callable — Llama 3 full family, Qwen2.5, Mistral, Mixtral, DeepSeek V3 / R1, Gemma, Phi, whisper, bge embeddings, SDXL, FLUX, etc.
- Dedicated Deployments: sold by GPU-hour, A100 / H100 / H200, customer-exclusive.
- Embedding & Vision API: covers embeddings / image gen / TTS / STT beyond text LLMs, a rare "all-in-one" open-source model warehouse.
- OpenAI-compatible endpoint: drop-in replacement for OpenAI SDK.
- The company is in Palo Alto, with team mainly ex-IMO / systems engineers, not a typical VC-funded big-spender.
2. Target Users & Pain Points
- Price-sensitive developers / small SaaS: migrating OpenAI calls to Llama 70B, saving 5-10×; DeepInfra is the market's lowest-price candidate.
- Batch / offline pipeline teams: large-scale embeddings, long-document summarization — unit price determines whether the project is viable.
- Pain points: self-building vLLM clusters has high barriers; other 3P prices remain on the higher side; DeepInfra is positioned at "lowest price + wide menu" niche.
- Trade-off: SLA / latency are inferior to Together, Fireworks, Groq; rate limits tighten during peaks. Suited for "price-priority, SLA-secondary" workloads.
3. Competitive Landscape
| Competitor | Positioning | Vs. DeepInfra |
|---|---|---|
| together-ai | Proprietary kernel + GPU cluster | Together has stronger performance / enterprise product depth; DeepInfra has lower price |
| fireworks-ai | FireAttention + compound AI | Fireworks leans mid-to-high end |
| Replicate | image / dev API | Replicate leans developer experimentation, billed by second; DeepInfra leans token billing, lower price |
| OpenRouter | Aggregator routing | OpenRouter uses DeepInfra as one of its upstreams |
| Hugging Face Inference Endpoints | Model hub with hosting | HF runs more dedicated; DeepInfra runs more serverless |
Differentiation: Price (lowest-tier range) + menu width (embeddings / image / audio all included) + OpenAI compatibility. Often appears as the cheapest provider on openrouter, with considerable share in OpenRouter's traffic allocation.
4. Unique Observations
- Per-token pricing (serverless, 2026-05 public): Llama 3.1 8B
$0.04 input + $0.04 output / M (i.e., ~$0.04/M blended, market floor); Llama 3.1 70B ~$0.23 input + $0.40 output / M (blended ~$0.30/M); Llama 3.1 405B ~$0.80 input + $0.80 output / M; DeepSeek V3 ~$0.49/M blended; Qwen2.5 72B ~$0.13 + $0.39 / M [1]. Section 4 prompt's "$0.20/M Llama" is too high for 8B, close for 70B, too low for 405B — strictly speaking, "low-price range across different sizes." - vs 1P price gap: Llama 3.1 70B @
$0.30/M blended vs GPT-4o blended ~$10/M → **33× price gap**. When the task can tolerate Llama 70B output quality, DeepInfra is the extreme economic point of "running 30 Llamas for the price of 1 GPT." - vs peers: DeepInfra Llama 70B ~$0.30/M < Together / Fireworks ~$0.88/M < Groq blended ~$0.70/M (but Groq is 5× faster). This shows 3P is now clearly tiered: speed-tier (Groq) / premium-tier (Fireworks, Together) / budget-tier (DeepInfra).
- Inference engine: doesn't claim proprietary engine publicly; from throughput / latency performance and job descriptions, looks like vLLM-based + proprietary patches. Differentiation is in ops + extreme batching, not at kernel layer.
- Compute sourcing: H100 / H200 mostly from nebius / coreweave / proprietary data center mix, self-reports some self-managed colo (conservative path, light capex).
- Take rate: industry estimates budget tier has thin gross margin (10-25%), relying on batching density + long-tail embedding traffic for volume. Coupled with OpenRouter's "price war" — once OpenRouter displays DeepInfra as cheapest, volume flows in.
- Capital model: a rare case of no disclosed large VC funding but profitable operation — an industry quiet "small-and-fine" case. Most peers raised hundreds of millions from VCs; DeepInfra leans bootstrap / small funding.
- Risk: upstream GPU price hike + continuously falling token unit price squeezes gross margin further; if OpenRouter becomes the absolute entry point, DeepInfra's bargaining power declines.
5. Financials / Funding
- Founded: 2022 (Palo Alto)
- Funding: limited public records; Crunchbase shows seed / pre-seed at low millions, no disclosed large Series A/B; industry believes possibly bootstrapped + small strategic investor [3]
- Customers: large long-tail dev base + indirect traffic via OpenRouter; specific ARR / token volume not disclosed
Note: DeepInfra is the most financially opaque of these 10; should update with any new round disclosure.
6. People & Relationships
- Founder / CEO: Nikola Borisov — ex-IMO / early Google / Slashdot engineer background; team mostly systems / GPU / compiler engineers.
- Investors: no disclosed major VC (a few strategic / angels).
- Partners: openrouter (largest indirect traffic entry), Hugging Face (model layer sync).
- Competes with: together-ai, fireworks-ai, Replicate, groq (when tiering down).
- Hosts models from: Meta (Llama), deepseek, Mistral, Alibaba (Qwen), Google (Gemma), Black Forest Labs (FLUX), Stability AI (SDXL).
Sources
- [1] https://deepinfra.com/pricing (2026-05-10)
- [2] https://deepinfra.com/models (2026-05-10)
- [3] https://www.crunchbase.com/organization/deepinfra (2026-05-10)
- [4] https://artificialanalysis.ai/providers/deepinfra (2026-05-10)
Last compiled: 2026-05-10