Company

DeepInfra

Minimalist pricing, ultra-broad menu, budget-tier 3P inference — frequently the market's lowest-price-range player for the same open-source model.

1. Core Product / Service

  • Serverless Token API: hundreds of open-source models one-click callable — Llama 3 full family, Qwen2.5, Mistral, Mixtral, DeepSeek V3 / R1, Gemma, Phi, whisper, bge embeddings, SDXL, FLUX, etc.
  • Dedicated Deployments: sold by GPU-hour, A100 / H100 / H200, customer-exclusive.
  • Embedding & Vision API: covers embeddings / image gen / TTS / STT beyond text LLMs, a rare "all-in-one" open-source model warehouse.
  • OpenAI-compatible endpoint: drop-in replacement for OpenAI SDK.
  • The company is in Palo Alto, with team mainly ex-IMO / systems engineers, not a typical VC-funded big-spender.

2. Target Users & Pain Points

  • Price-sensitive developers / small SaaS: migrating OpenAI calls to Llama 70B, saving 5-10×; DeepInfra is the market's lowest-price candidate.
  • Batch / offline pipeline teams: large-scale embeddings, long-document summarization — unit price determines whether the project is viable.
  • Pain points: self-building vLLM clusters has high barriers; other 3P prices remain on the higher side; DeepInfra is positioned at "lowest price + wide menu" niche.
  • Trade-off: SLA / latency are inferior to Together, Fireworks, Groq; rate limits tighten during peaks. Suited for "price-priority, SLA-secondary" workloads.

3. Competitive Landscape

Competitor Positioning Vs. DeepInfra
together-ai Proprietary kernel + GPU cluster Together has stronger performance / enterprise product depth; DeepInfra has lower price
fireworks-ai FireAttention + compound AI Fireworks leans mid-to-high end
Replicate image / dev API Replicate leans developer experimentation, billed by second; DeepInfra leans token billing, lower price
OpenRouter Aggregator routing OpenRouter uses DeepInfra as one of its upstreams
Hugging Face Inference Endpoints Model hub with hosting HF runs more dedicated; DeepInfra runs more serverless

Differentiation: Price (lowest-tier range) + menu width (embeddings / image / audio all included) + OpenAI compatibility. Often appears as the cheapest provider on openrouter, with considerable share in OpenRouter's traffic allocation.

4. Unique Observations

  • Per-token pricing (serverless, 2026-05 public): Llama 3.1 8B $0.04 input + $0.04 output / M (i.e., ~$0.04/M blended, market floor); Llama 3.1 70B ~$0.23 input + $0.40 output / M (blended ~$0.30/M); Llama 3.1 405B ~$0.80 input + $0.80 output / M; DeepSeek V3 ~$0.49/M blended; Qwen2.5 72B ~$0.13 + $0.39 / M [1]. Section 4 prompt's "$0.20/M Llama" is too high for 8B, close for 70B, too low for 405B — strictly speaking, "low-price range across different sizes."
  • vs 1P price gap: Llama 3.1 70B @ $0.30/M blended vs GPT-4o blended ~$10/M → **33× price gap**. When the task can tolerate Llama 70B output quality, DeepInfra is the extreme economic point of "running 30 Llamas for the price of 1 GPT."
  • vs peers: DeepInfra Llama 70B ~$0.30/M < Together / Fireworks ~$0.88/M < Groq blended ~$0.70/M (but Groq is 5× faster). This shows 3P is now clearly tiered: speed-tier (Groq) / premium-tier (Fireworks, Together) / budget-tier (DeepInfra).
  • Inference engine: doesn't claim proprietary engine publicly; from throughput / latency performance and job descriptions, looks like vLLM-based + proprietary patches. Differentiation is in ops + extreme batching, not at kernel layer.
  • Compute sourcing: H100 / H200 mostly from nebius / coreweave / proprietary data center mix, self-reports some self-managed colo (conservative path, light capex).
  • Take rate: industry estimates budget tier has thin gross margin (10-25%), relying on batching density + long-tail embedding traffic for volume. Coupled with OpenRouter's "price war" — once OpenRouter displays DeepInfra as cheapest, volume flows in.
  • Capital model: a rare case of no disclosed large VC funding but profitable operation — an industry quiet "small-and-fine" case. Most peers raised hundreds of millions from VCs; DeepInfra leans bootstrap / small funding.
  • Risk: upstream GPU price hike + continuously falling token unit price squeezes gross margin further; if OpenRouter becomes the absolute entry point, DeepInfra's bargaining power declines.

5. Financials / Funding

  • Founded: 2022 (Palo Alto)
  • Funding: limited public records; Crunchbase shows seed / pre-seed at low millions, no disclosed large Series A/B; industry believes possibly bootstrapped + small strategic investor [3]
  • Customers: large long-tail dev base + indirect traffic via OpenRouter; specific ARR / token volume not disclosed

Note: DeepInfra is the most financially opaque of these 10; should update with any new round disclosure.

6. People & Relationships

  • Founder / CEO: Nikola Borisov — ex-IMO / early Google / Slashdot engineer background; team mostly systems / GPU / compiler engineers.
  • Investors: no disclosed major VC (a few strategic / angels).
  • Partners: openrouter (largest indirect traffic entry), Hugging Face (model layer sync).
  • Competes with: together-ai, fireworks-ai, Replicate, groq (when tiering down).
  • Hosts models from: Meta (Llama), deepseek, Mistral, Alibaba (Qwen), Google (Gemma), Black Forest Labs (FLUX), Stability AI (SDXL).

Sources

Last compiled: 2026-05-10