Baseten

Model serving-specific platform — lets AI engineering teams deploy AI models like web services, with deep involvement in TensorRT-LLM, SGLang, and vLLM tuning.

1. Core Product / Service

Baseten is the player in 3P inference that specializes in model serving infra:

Truss: open-source model packaging framework (maintained by Baseten), packaging model + inference code + dependencies into a deployable unit. Similar to replicate's Cog, but Truss is more production-grade, emphasizing observability.
Dedicated Deployments: customer-exclusive GPU pool, billed by GPU-minute, with Baseten auto-handling autoscaling / canary / blue-green deployment.
Model Library: pre-packaged model menu — Llama 3 full family, Whisper, SDXL, FLUX, custom fine-tuned weights, etc., one-click deployable to dedicated deployment.
Chains: tool to string multi-model / multi-step inference into production pipelines (similar positioning to fireworks-ai's Compound AI).
Performance Engineering team: public cases show the Baseten team will do TensorRT-LLM / vLLM / SGLang tuning together with large customers — this is the essential difference from "API as abstraction" players.

2. Target Users & Pain Points

AI companies with their own fine-tuned weights: teams unwilling to hand model weights to token APIs (IP concerns) but unwilling to self-build K8s + Triton stacks.
Latency / throughput-sensitive production applications: voice / real-time translation / multi-step agents — need dedicated GPUs rather than noisy serverless.
Pain points: self-hosting TensorRT-LLM requires CUDA / kernel / Triton experts; Baseten hides these behind the platform + team consulting services.
vs token API: Baseten customers pay for isolation + controllable latency + first-party weights not leaking, not token price.

3. Competitive Landscape

Competitor	Positioning	Vs. Baseten
modal	serverless GPU compute	Modal leans general GPU functions; Baseten leans model serving-specific stack
replicate	model hub + Cog	Replicate is dev-first / long-tail; Baseten is enterprise / production
fireworks-ai / together-ai	token API + dedicated options	Both also offer dedicated; Baseten's edge is "completely your stack" (including custom engine, kernel)
AWS SageMaker	Cloud-managed	SageMaker is complex; Baseten is simple
NVIDIA Triton self-hosted	Self-deployed	Triton is flexible; Baseten packages Triton as a product
Bento / Anyscale	model serving peer	Baseten has customer reputation / engineering depth lead

Differentiation: Truss framework + dedicated focus + performance engineering services. In the middle market of "self-hosted models + want SLA + don't want to operate Triton," it's the de facto standard.

4. Unique Observations

GPU pricing (dedicated, 2026-05): H100 80GB ~~$0.10/min (~~$6/h), A100 80GB ~~$0.07/min (~~$4.20/h), L4 ~$0.014/min, T4 ~$0.011/min [1]. More expensive than modal (H100 ~$3.95/h) and runpod — Baseten sells GPU + platform / engineering services.
Token not directly priced: customers calculate token unit price themselves after deploying Baseten model library; industry measurements put Llama 70B on 1×H100 dedicated at ~$0.30-0.50/M blended (full load ideal), ~$0.50-1.0/M after autoscale buffer. Slightly more expensive than token-API players, but fully exclusive + first-party weights + tunable kernel.
vs 1P price gap: Llama 70B dedicated on Baseten ~$0.80/M vs GPT-4o ~$10/M → ~12×; vs DeepInfra ~$0.30/M → Baseten 2-3× more expensive, but buys isolation.
Inference engine: multi-engine polyglot — TensorRT-LLM, vLLM, SGLang, TGI all supported, with Baseten team and customers picking the optimal together. This is the essential difference from peers binding tightly to a single engine (inferact locked to vLLM, fireworks-ai locked to FireAttention).
Compute sourcing: rents H100 / A100 pools from multiple L2 hyperscalers (coreweave / Oracle / GCP / AWS), with proprietary capacity scheduling layer; doesn't self-build colo. take rate ≈ (sale price - GPU rental) / sale price; industry estimates 30-40% for dedicated tier.
Performance Engineering services: Baseten frequently publishes technical blogs like "we pushed Llama 70B on H100 to X tok/s" — both marketing asset and sales tool, directly selling "performance engineering capability" to enterprises.
Strategic tradeoff: doesn't participate in token serverless price war → higher unit price → narrower but stickier customer base. ARR growth pre-Series C was reportedly strong (public undisclosed specifics), one of the few dedicated-tier "pure business" players to scale.
Risk: if token API players ramp dedicated products (Together, Fireworks are both strengthening dedicated), Baseten's middle market will be squeezed. Baseten must defend differentiation via engineering team consulting service and polyglot engine capability.

5. Financials / Funding

Round	Date	Amount	Valuation	Lead
Seed	2019	$2.6M	—	Greylock
Series A	2022	$20M	—	Greylock
Series B	2023	$40M	—	Spark Capital
Series C	2025-02	$75M	~$825M post	IVP [2]
Series D (reported)	2025-12	~$150M	~$2.1B	a16z (reported)

Founded: 2019
Total raised: ~$290M
Customers: public cases include Descript, Patreon, Pictory, Robust Intelligence; ARR undisclosed but IVP / a16z entries reflect strong growth.

6. People & Relationships

Co-founders: Tuhin Srivastava (CEO, ex-Gumroad), Amir Haghighat, Pankaj Gupta, Phil Howes — all ex-Gumroad team; product DNA leans dev tool.
Investors: IVP, Spark Capital, Greylock, South Park Commons, a16z (reported).
Partners: NVIDIA (TensorRT-LLM early partner), Hugging Face.
Competes with: modal, replicate, fireworks-ai (in dedicated tier), together-ai (in dedicated tier), AWS SageMaker.

Sources

[1] https://www.baseten.co/pricing (2026-05-10)
[2] https://www.baseten.co/blog/announcing-our-series-c (2026-05-10)
[3] https://www.crunchbase.com/organization/baseten (2026-05-10)
[4] https://www.baseten.co/library/ (2026-05-10)