Company

Modal

Turning "writing a function" into "running code on a GPU" — a dev-first serverless GPU compute play, backed by a16z, founded by Erik Bernhardsson.

1. Core Product / Service

Modal is the underlying abstraction of 3P inference — it doesn't sell tokens, it doesn't sell fixed image APIs; it sells the developer experience of writing a Python function that runs serverless on a GPU.

  • Modal Functions: a Python decorator (@app.function(gpu="H100")) ships the function to the cloud GPU for execution. Cold start in seconds, billed by the second.
  • Container + GPU scheduling: backed by an in-house container runtime + GPU pools rented from multiple L2 providers (H100/A100/A10G/L40S/T4 — complete menu).
  • Web Endpoints: expose functions as HTTPS interfaces; many teams use Modal to self-host LLMs, batch jobs, and in-house small-model inference services.
  • Volumes / Networks / Sandboxes: complete sandbox + persistent storage abstraction, supporting untrusted code execution (an unexpectedly important selling point in the agent era — Claude / GPT in-house code execution also uses similar abstractions).
  • Not a token API: Modal doesn't pre-load a menu of "call Llama 70B → give me tokens"; you have to package the model yourself. This is the most fundamental difference vs Together/Fireworks.

2. Target Users & Pain Points

  • ML engineers / AI startups: want to run experiments, batch jobs, custom model inference — don't want to operate K8s + GPU autoscaling.
  • Agent / Sandbox companies: need the "safely run user-submitted code" sandbox abstraction; Modal's container isolation + elastic GPUs is one of the few ready-made solutions.
  • Workflow / pipeline teams: running irregular loads like video processing, embedding indexing, training sweeps — Modal's ergonomics are an order of magnitude shorter than K8s + Ray.
  • Pain point: building K8s + GPU operator + autoscaling + image pipeline yourself takes at least 3-6 person-months; Modal's decorator pattern compresses this into one line.

3. Competitive Landscape

Competitor Positioning Vs. Modal
replicate Model hub + Cog container Replicate leans toward hosted ready-made models; Modal leans toward writing your own code
runpod Per-minute GPU rental + serverless RunPod is cheaper but weaker on dev UX; Modal is a higher-level abstraction
baseten Model serving dedicated Baseten leans toward deployment + monitoring; Modal leans toward general-purpose GPU functions
AWS Lambda + GPU / SageMaker Cloud-provider general Cloud-provider learning curve is steep; Modal is Python-native and simple
Beam.cloud / Banana Same serverless GPU class Modal leads on capital and ecosystem
fireworks-ai / together-ai Token API Completely different abstraction layer (function vs token)

Differentiation: Python-native ergonomics + integrated container / sandbox / GPU abstraction. Developers just write functions; no K8s.

4. Unique Observations

  • GPU-second pricing (2026-05): H100 $0.001097/s ($3.95/h), A100 80GB $0.000694/s ($2.50/h), A10G $0.000306/s ($1.10/h), T4 $0.000164/s ($0.59/h), CPU $0.0000131/core-s [1]. Slightly more expensive than runpod but much cheaper than AWS / GCP equivalent GPU list prices.
  • Tokens not priced directly: Modal doesn't sell "call Llama → give tokens"; customers must self-deploy. But the industry commonly runs vLLM on Modal as an in-house OpenAI-compatible endpoint — self-deployed Llama 70B on 1×H100 at ~$3.95/h, running ~10K tok/s output (vLLM batched) → ~$0.11 / M tokens marginal cost (ideal full GPU utilization); with idle / autoscaling actually ~$0.30-0.60/M. Slightly more expensive than DeepInfra's own ~$0.30/M, but you get full control + private data.
  • Vs first-party (1P) price gap: treating Modal as a "self-hosted OpenAI-compat alternative" cost floor — running Llama 70B at ~$0.50/M blended vs GPT-4o ~$10/M → ~20× price gap, but you must handle latency / model ops yourself.
  • Inference engine: Modal doesn't mandate one — users can install vLLM, SGLang, TGI, TensorRT-LLM, any of them. Modal is a GPU compute platform, not an engine.
  • Compute source: rents GPU pools from GCP / Oracle / coreweave / Lambda and other L2 providers; doesn't do hardware or data centers itself. Modal's differentiation is at the scheduling layer + container orchestration + cold-start optimization (in-house lazy loading + image caching can achieve cold start <2s).
  • Agent / Sandbox pull on Modal: in the agent era of 2024-2025 many startups use Modal as a code-execution sandbox (same demand as Anthropic Code Interpreter); a key source of Modal's incremental traffic.
  • Capital model + strategy: Modal doesn't sell its own token API, so it doesn't participate in the token price war; the moat is engineer ecosystem + ergonomics — comparable to Vercel for web, Modal for GPU. This is a16z's investment thesis.
  • Risks: as fireworks-ai / baseten / together-ai dedicated deployment products mature, "why not just use the token API?" gets asked more loudly. Modal has to survive on sandbox / agent / batch / custom-model scenarios that token APIs don't handle well.

5. Financials / Funding

Round Date Amount Valuation Lead
Seed 2022 $7M Lux Capital, Definition
Series A 2023 $16M Redpoint
Series B 2024-09 $80M $1.1B post Lux Capital + Andreessen Horowitz [3]
  • Founded: 2021
  • Total funding: ~$103M
  • Public ARR / user counts not disclosed; industry estimates tens of thousands of active developers

6. People & Relationships

  • Founder / CEO: Erik Bernhardsson — former Spotify ML infra head (author of the open-source Annoy library); a known figure in the ML infra community.
  • Co-founder / CTO: Akshat Bubna.
  • Investors: Lux Capital, Andreessen Horowitz, Redpoint, Definition, Amplify Partners.
  • Partners / Customers: Suno, Ramp, ElevenLabs, Substack, etc., public case studies; agent companies are well-represented.
  • Competes with: replicate, baseten, runpod (serverless tier), Beam.cloud.

Sources

Last compiled: 2026-05-10