DeepInfra

Minimalist pricing, ultra-broad menu, budget-tier 3P inference — frequently the market's lowest-price-range player for the same open-source model.

1. Core Product / Service

Serverless Token API: hundreds of open-source models one-click callable — Llama 3 full family, Qwen2.5, Mistral, Mixtral, DeepSeek V3 / R1, Gemma, Phi, whisper, bge embeddings, SDXL, FLUX, etc.
Dedicated Deployments: sold by GPU-hour, A100 / H100 / H200, customer-exclusive.
Embedding & Vision API: covers embeddings / image gen / TTS / STT beyond text LLMs, a rare "all-in-one" open-source model warehouse.
OpenAI-compatible endpoint: drop-in replacement for OpenAI SDK.
The company is in Palo Alto, with team mainly ex-IMO / systems engineers, not a typical VC-funded big-spender.

2. Target Users & Pain Points

Price-sensitive developers / small SaaS: migrating OpenAI calls to Llama 70B, saving 5-10×; DeepInfra is the market's lowest-price candidate.
Batch / offline pipeline teams: large-scale embeddings, long-document summarization — unit price determines whether the project is viable.
Pain points: self-building vLLM clusters has high barriers; other 3P prices remain on the higher side; DeepInfra is positioned at "lowest price + wide menu" niche.
Trade-off: SLA / latency are inferior to Together, Fireworks, Groq; rate limits tighten during peaks. Suited for "price-priority, SLA-secondary" workloads.

3. Competitive Landscape

Competitor	Positioning	Vs. DeepInfra
together-ai	Proprietary kernel + GPU cluster	Together has stronger performance / enterprise product depth; DeepInfra has lower price
fireworks-ai	FireAttention + compound AI	Fireworks leans mid-to-high end
Replicate	image / dev API	Replicate leans developer experimentation, billed by second; DeepInfra leans token billing, lower price
OpenRouter	Aggregator routing	OpenRouter uses DeepInfra as one of its upstreams
Hugging Face Inference Endpoints	Model hub with hosting	HF runs more dedicated; DeepInfra runs more serverless

Differentiation: Price (lowest-tier range) + menu width (embeddings / image / audio all included) + OpenAI compatibility. Often appears as the cheapest provider on openrouter, with considerable share in OpenRouter's traffic allocation.

4. Unique Observations

Per-token pricing (serverless, 2026-05 public): Llama 3.1 8B $0.04 input + $0.04 output / M (i.e., ~$0.04/M blended, market floor); Llama 3.1 70B ~$0.23 input + $0.40 output / M (blended ~$0.30/M); Llama 3.1 405B ~$0.80 input + $0.80 output / M; DeepSeek V3 ~$0.49/M blended; Qwen2.5 72B ~$0.13 + $0.39 / M [1]. Section 4 prompt's "$0.20/M Llama" is too high for 8B, close for 70B, too low for 405B — strictly speaking, "low-price range across different sizes."
vs 1P price gap: Llama 3.1 70B @ $0.30/M blended vs GPT-4o blended ~$10/M → **33× price gap**. When the task can tolerate Llama 70B output quality, DeepInfra is the extreme economic point of "running 30 Llamas for the price of 1 GPT."
vs peers: DeepInfra Llama 70B ~$0.30/M < Together / Fireworks ~$0.88/M < Groq blended ~$0.70/M (but Groq is 5× faster). This shows 3P is now clearly tiered: speed-tier (Groq) / premium-tier (Fireworks, Together) / budget-tier (DeepInfra).
Inference engine: doesn't claim proprietary engine publicly; from throughput / latency performance and job descriptions, looks like vLLM-based + proprietary patches. Differentiation is in ops + extreme batching, not at kernel layer.
Compute sourcing: H100 / H200 mostly from nebius / coreweave / proprietary data center mix, self-reports some self-managed colo (conservative path, light capex).
Take rate: industry estimates budget tier has thin gross margin (10-25%), relying on batching density + long-tail embedding traffic for volume. Coupled with OpenRouter's "price war" — once OpenRouter displays DeepInfra as cheapest, volume flows in.
Capital model: a rare case of no disclosed large VC funding but profitable operation — an industry quiet "small-and-fine" case. Most peers raised hundreds of millions from VCs; DeepInfra leans bootstrap / small funding.
Risk: upstream GPU price hike + continuously falling token unit price squeezes gross margin further; if OpenRouter becomes the absolute entry point, DeepInfra's bargaining power declines.

5. Financials / Funding

Founded: 2022 (Palo Alto)
Funding: limited public records; Crunchbase shows seed / pre-seed at low millions, no disclosed large Series A/B; industry believes possibly bootstrapped + small strategic investor [3]
Customers: large long-tail dev base + indirect traffic via OpenRouter; specific ARR / token volume not disclosed

Note: DeepInfra is the most financially opaque of these 10; should update with any new round disclosure.

6. People & Relationships

Founder / CEO: Nikola Borisov — ex-IMO / early Google / Slashdot engineer background; team mostly systems / GPU / compiler engineers.
Investors: no disclosed major VC (a few strategic / angels).
Partners: openrouter (largest indirect traffic entry), Hugging Face (model layer sync).
Competes with: together-ai, fireworks-ai, Replicate, groq (when tiering down).
Hosts models from: Meta (Llama), deepseek, Mistral, Alibaba (Qwen), Google (Gemma), Black Forest Labs (FLUX), Stability AI (SDXL).

Sources

[1] https://deepinfra.com/pricing (2026-05-10)
[2] https://deepinfra.com/models (2026-05-10)
[3] https://www.crunchbase.com/organization/deepinfra (2026-05-10)
[4] https://artificialanalysis.ai/providers/deepinfra (2026-05-10)