Replicate

Turn any open-source model (image, video, voice, LLM) into an HTTP API with one click; the dev-first inference platform billed by GPU-second.

1. Core Product / Service

Replicate is the most dev-first player in 3P inference, selling on developer experience + model menu breadth:

Cog framework: an open-source tool (maintained by Replicate) that packages any ML model into a hostable container; anyone who pushes a model to Replicate gets an HTTP API. This is the platform's moat — a large number of long-tail image / video / audio model authors use Cog to publish their models, and Replicate becomes the aggregation site.
Run on Replicate: use other people's publicly published models. The menu includes SDXL, FLUX, Stable Video Diffusion, Whisper, Llama 3, Llava, CLIP, and many community fine-tuned models — tens of thousands of models.
Deployments: private / dedicated deployments, exclusive GPU pool, avoiding cold-start.
Billing: per GPU-second (not per-token) — this is what fundamentally distinguishes Replicate from other 3P providers. Per-token billing on LLMs also exists but isn't the main axis.
Target user scenarios: image / video generation prototyping, indie developers, community fine-tune showcases, scenarios needing "a few lines of code to call any model".

2. Target Users & Pain Points

Indie / early developers / weekend projects: 5 lines of Python / curl to call any SOTA open-source model, no need to own a GPU, no need to run an inference stack.
Image / video / audio production pipelines: the image-model ecosystem on Replicate is the richest, a generation ahead of token-only platforms.
Pain point: self-hosting SDXL / FLUX requires A100 / H100 + model warm-up + latency optimization; Replicate hides all this behind an API.
vs own GPUs: Replicate is economical at small volume; for high-traffic steady-state long runs (hundreds of dollars per month or more), self-host is cheaper — but many users never hit that critical point.

3. Competitive Landscape

Competitor	Positioning	Vs. Replicate
modal	serverless GPU compute	Modal is raw functions + GPU; Replicate is hosted model library + API; Modal is more flexible, Replicate more portable
fireworks-ai / together-ai	LLM token API	Both focus on LLMs; Replicate focuses on image/video + long tail
deepinfra	budget LLM API	Weak price competition (Replicate doesn't push LLM pricing)
Hugging Face Inference Endpoints	model hub + hosting	HF's model repo is larger; Replicate's API UX + Cog tooling experience is better
fal.ai	image / video generation specialist	Direct competition in image / video tier; fal.ai is faster but model menu is narrower
runpod	raw GPU rental	RunPod doesn't abstract model hosting; Replicate is a higher-layer product

Differentiation: Cog toolchain + model ecosystem community + GPU-second billing + lowest-threshold dev UX.

4. Unique Observations

Per-token pricing (LLMs, 2026-05): Llama 3 70B ~$0.65/M input + $2.75/M output (blended ~$1.5/M); Llama 3 8B ~$0.05 input + $0.25 output / M; many LLMs are token-wrapped but underlying is still GPU-second. LLM pricing is mid-to-expensive, not Replicate's selling point.
GPU-second billing: Nvidia A100 (80GB) ~$0.001400/s; Nvidia H100 ~$0.001525/s; T4 ~$0.000225/s [1]. One SDXL image generation typically takes ~3-5 seconds → ~$0.005-0.008/image.
vs first-party price gap (LLMs): Llama 3 70B blended ~$1.5/M vs GPT-4o ~$10/M → ~6× gap. But vs DeepInfra Llama 70B ~$0.30/M → Replicate is 5× more expensive. This shows that Replicate doesn't participate in the LLM token price war.
vs first-party price gap (image / video): FLUX-1.1-pro ~$0.04/img on Replicate vs MJ subscription ~$10/mo for 200 images — ~$0.05/image; Replicate is on par with fal.ai / FLUX official API — media generation is Replicate's true core battlefield.
Inference engine: each model brings its own (per author, in the Cog container) — the Replicate platform doesn't enforce a unified engine. So Replicate is a "scheduling layer + container orchestration" company, not an engine company. This is fundamentally different from the Together / Fireworks path.
Compute source: rents GPUs from L2 hyperscalers such as runpod / coreweave / GCP; doesn't own data centers. Take rate is (GPU-second sale price - upstream rental) / sale price; industry estimates 30-40%.
Strategic trade-off: the Cog + community model gives long-tail coverage no one can match, but any model sold per-token on Replicate is not cheap — because packing into a Cog container brings high cold-start + per-call overhead, the per-token cost structure is poor. So Replicate's LLM share has kept being eroded by Together / Fireworks / DeepInfra, but in image / video it remains stable.
Capital model: backed by Y Combinator + a16z + Sequoia / NVentures, valued around $400M (2024 report) — far below Together, Fireworks, Groq. Reflects how "dev tool / model aggregator" vs "capital-heavy inference infrastructure" markets price differently.

5. Financials / Funding

Round	Date	Amount	Valuation	Lead
Seed	2020	$2.4M	—	Y Combinator
Series A	2022	$17.8M	—	Andreessen Horowitz
Series B	2023-12	$40M	~$350M post	Andreessen Horowitz [3]
Series C (reported)	2024-12	~$50M	~$500M post	a16z follow-on

Founded: 2019
Total raised: ~$120M
Public statements: millions of monthly active developers; specific ARR undisclosed

6. People & Relationships

Co-founders: Ben Firshman (CEO, ex-Docker Compose maintainer) + Andreas Jansson (ex-Spotify).
Investors: a16z, Sequoia, Y Combinator, NVentures (NVIDIA), HOF Capital.
Partners: Black Forest Labs (FLUX early exclusive launch), Stability AI, Meta (Llama).
Competes with: modal, fal.ai, Hugging Face Inference Endpoints, fireworks-ai / together-ai (in the LLM tier).
Hosts models from: thousands of community authors + Black Forest Labs, Meta, Mistral, Stability AI, OpenAI Whisper.

Sources

[1] https://replicate.com/pricing (2026-05-10)
[2] https://replicate.com/docs (2026-05-10)
[3] https://www.crunchbase.com/organization/replicate (2026-05-10)
[4] https://replicate.com/blog (2026-05-10)