Modal

Turning "writing a function" into "running code on a GPU" — a dev-first serverless GPU compute play, backed by a16z, founded by Erik Bernhardsson.

1. Core Product / Service

Modal is the underlying abstraction of 3P inference — it doesn't sell tokens, it doesn't sell fixed image APIs; it sells the developer experience of writing a Python function that runs serverless on a GPU.

Modal Functions: a Python decorator (@app.function(gpu="H100")) ships the function to the cloud GPU for execution. Cold start in seconds, billed by the second.
Container + GPU scheduling: backed by an in-house container runtime + GPU pools rented from multiple L2 providers (H100/A100/A10G/L40S/T4 — complete menu).
Web Endpoints: expose functions as HTTPS interfaces; many teams use Modal to self-host LLMs, batch jobs, and in-house small-model inference services.
Volumes / Networks / Sandboxes: complete sandbox + persistent storage abstraction, supporting untrusted code execution (an unexpectedly important selling point in the agent era — Claude / GPT in-house code execution also uses similar abstractions).
Not a token API: Modal doesn't pre-load a menu of "call Llama 70B → give me tokens"; you have to package the model yourself. This is the most fundamental difference vs Together/Fireworks.

2. Target Users & Pain Points

ML engineers / AI startups: want to run experiments, batch jobs, custom model inference — don't want to operate K8s + GPU autoscaling.
Agent / Sandbox companies: need the "safely run user-submitted code" sandbox abstraction; Modal's container isolation + elastic GPUs is one of the few ready-made solutions.
Workflow / pipeline teams: running irregular loads like video processing, embedding indexing, training sweeps — Modal's ergonomics are an order of magnitude shorter than K8s + Ray.
Pain point: building K8s + GPU operator + autoscaling + image pipeline yourself takes at least 3-6 person-months; Modal's decorator pattern compresses this into one line.

3. Competitive Landscape

Competitor	Positioning	Vs. Modal
replicate	Model hub + Cog container	Replicate leans toward hosted ready-made models; Modal leans toward writing your own code
runpod	Per-minute GPU rental + serverless	RunPod is cheaper but weaker on dev UX; Modal is a higher-level abstraction
baseten	Model serving dedicated	Baseten leans toward deployment + monitoring; Modal leans toward general-purpose GPU functions
AWS Lambda + GPU / SageMaker	Cloud-provider general	Cloud-provider learning curve is steep; Modal is Python-native and simple
Beam.cloud / Banana	Same serverless GPU class	Modal leads on capital and ecosystem
fireworks-ai / together-ai	Token API	Completely different abstraction layer (function vs token)

Differentiation: Python-native ergonomics + integrated container / sandbox / GPU abstraction. Developers just write functions; no K8s.

4. Unique Observations

GPU-second pricing (2026-05): H100 $0.001097/s (~~$3.95/h), A100 80GB $0.000694/s (~~$2.50/h), A10G $0.000306/s (~~$1.10/h), T4 $0.000164/s (~~$0.59/h), CPU $0.0000131/core-s [1]. Slightly more expensive than runpod but much cheaper than AWS / GCP equivalent GPU list prices.
Tokens not priced directly: Modal doesn't sell "call Llama → give tokens"; customers must self-deploy. But the industry commonly runs vLLM on Modal as an in-house OpenAI-compatible endpoint — self-deployed Llama 70B on 1×H100 at ~$3.95/h, running ~10K tok/s output (vLLM batched) → ~$0.11 / M tokens marginal cost (ideal full GPU utilization); with idle / autoscaling actually ~$0.30-0.60/M. Slightly more expensive than DeepInfra's own ~$0.30/M, but you get full control + private data.
Vs first-party (1P) price gap: treating Modal as a "self-hosted OpenAI-compat alternative" cost floor — running Llama 70B at ~$0.50/M blended vs GPT-4o ~$10/M → ~20× price gap, but you must handle latency / model ops yourself.
Inference engine: Modal doesn't mandate one — users can install vLLM, SGLang, TGI, TensorRT-LLM, any of them. Modal is a GPU compute platform, not an engine.
Compute source: rents GPU pools from GCP / Oracle / coreweave / Lambda and other L2 providers; doesn't do hardware or data centers itself. Modal's differentiation is at the scheduling layer + container orchestration + cold-start optimization (in-house lazy loading + image caching can achieve cold start <2s).
Agent / Sandbox pull on Modal: in the agent era of 2024-2025 many startups use Modal as a code-execution sandbox (same demand as Anthropic Code Interpreter); a key source of Modal's incremental traffic.
Capital model + strategy: Modal doesn't sell its own token API, so it doesn't participate in the token price war; the moat is engineer ecosystem + ergonomics — comparable to Vercel for web, Modal for GPU. This is a16z's investment thesis.
Risks: as fireworks-ai / baseten / together-ai dedicated deployment products mature, "why not just use the token API?" gets asked more loudly. Modal has to survive on sandbox / agent / batch / custom-model scenarios that token APIs don't handle well.

5. Financials / Funding

Round	Date	Amount	Valuation	Lead
Seed	2022	$7M	—	Lux Capital, Definition
Series A	2023	$16M	—	Redpoint
Series B	2024-09	$80M	$1.1B post	Lux Capital + Andreessen Horowitz [3]

Founded: 2021
Total funding: ~$103M
Public ARR / user counts not disclosed; industry estimates tens of thousands of active developers

6. People & Relationships

Founder / CEO: Erik Bernhardsson — former Spotify ML infra head (author of the open-source Annoy library); a known figure in the ML infra community.
Co-founder / CTO: Akshat Bubna.
Investors: Lux Capital, Andreessen Horowitz, Redpoint, Definition, Amplify Partners.
Partners / Customers: Suno, Ramp, ElevenLabs, Substack, etc., public case studies; agent companies are well-represented.
Competes with: replicate, baseten, runpod (serverless tier), Beam.cloud.

Sources

[1] https://modal.com/pricing (2026-05-10)
[2] https://modal.com/blog (2026-05-10)
[3] https://a16z.com/announcement/investing-in-modal/ (2026-05-10)
[4] https://www.crunchbase.com/organization/modal-labs (2026-05-10)