Renting GPU for Inference
How to think about running open-weight model inference on rented GPU rather than managed APIs.
The "open weights + rented GPU" path is now a viable middle ground between calling hosted APIs (openrouter, together-ai) and owning hardware. This module is a decision framework: when does it make sense, what serving stack to run, what GPU to pick, and how to compare providers.
1. When to rent GPU instead of calling an API
Hosted inference (per-token billing) wins on small-to-medium token volume, bursty traffic, and frontier closed models. Rented GPU starts to win when at least one of the following holds:
- Token volume is high enough that GPU-hour math beats per-token math. Rough back-of-envelope: a single A100 running a well-tuned 14–32B model at FP8/INT8 sustains 1–3k tokens/sec aggregate throughput. At ~$1.5/hr that's well under $0.01 per million tokens of GPU cost, vs. $0.10–$2 per million on hosted APIs. Break-even comes faster than people think when traffic is steady.
- You need a model nobody hosts — a fine-tune, a niche open weight, a recent release before providers add it, or a custom LoRA stack.
- Latency / privacy / data-locality rules out third-party APIs.
- You want to control sampling / logits / speculative decoding / custom KV cache reuse that hosted endpoints don't expose.
If none of those apply, stay on openrouter or a direct provider. Rented GPU has real ops cost (image building, autoscaling, cold-start handling, eviction recovery on spot tiers) that per-token APIs hide.
2. Pods vs. Serverless: the core trade-off
Most GPU-rental providers expose two product shapes. Understanding which one you're using is the first decision.
Pods (long-running VM/container)
- You rent the GPU by the minute/hour. It's yours until you stop it.
- Pros: zero cold start, full SSH/root, persistent disk, ideal for batch jobs, model loading once, sustained traffic.
- Cons: you pay 100% of the time even when idle. Scaling is manual or you script it.
- Use when: traffic is steady, batch processing, dev/eval, fine-tuning.
Serverless / per-request GPU
- The provider keeps a pool of warm workers; you pay per request-second.
- Pros: scales to zero, no idle cost, autoscale on bursty traffic.
- Cons: cold starts can be 30s–2min for a 14B model (weights load from disk), per-second pricing has a markup over raw pod pricing, less control over the runtime.
- Use when: bursty / unpredictable load, long-tail or low-QPS endpoints, demos.
Rule of thumb: if expected utilization is above ~30–40%, pods are cheaper. Below that, serverless wins despite the per-second markup.
3. Serving stack: vLLM, SGLang, TGI, TensorRT-LLM
The rented GPU is just a VM — you still need an inference engine. See ai-inference-engines for the full comparison; quick guide:
- vLLM — default choice. PagedAttention, continuous batching, broad model coverage, good Python ergonomics, OpenAI-compatible server out of the box. Start here unless you have a reason not to.
- SGLang — competitive throughput, particularly strong on structured generation (JSON / regex constraints) and RadixAttention prefix caching. Good for agentic workloads with shared system prompts.
- TGI (Hugging Face) — solid, slightly behind on latest model support and throughput tricks. Easy if you live in HF ecosystem.
- TensorRT-LLM — fastest on NVIDIA hardware if you're willing to compile per-model engines. Heavy ops cost; pick this only when squeezed-out latency matters.
Quantization (FP8 on H100, INT8/AWQ/GPTQ on A100) typically gives 1.5–2× throughput with negligible quality loss on 7B–70B models. Almost always worth turning on for production.
4. A100 vs. H100: how to choose
| A100 80GB | H100 80GB | |
|---|---|---|
| Price (rental, public) | ~$1–2/hr | ~$2.5–4/hr |
| FP16 throughput | 1× baseline | ~2–3× |
| FP8 support | No (BF16/FP16/INT8 only) | Yes (native) |
| Best for | 7–32B FP16, 70B AWQ/GPTQ | 70B+ FP8, long context, low latency |
Pick A100 when: model fits in 80GB at INT8/AWQ, you're throughput-bound not latency-bound, budget matters more than P99. Most 7–32B production inference is fine here.
Pick H100 when: running 70B+ at FP8 where the extra throughput pays for the price gap, sequence length is long (the SRAM/HBM bandwidth helps), or you need lowest single-request latency (chatbot UX).
For multi-GPU: H100 NVLink scales noticeably better than A100 NVLink for tensor-parallel 70B+ workloads.
Newer parts (H200, B200) are starting to show up on rental marketplaces — generally only worth it for very long context or very large models where memory bandwidth is the bottleneck.
5. Cost-of-burst-inference economics
The trap: people compare hosted-API price-per-token to peak GPU throughput price-per-token and conclude rented GPU is 10× cheaper. That's only true at sustained high utilization. Real cost is:
effective $/Mtok = (gpu_$_per_hour) / (avg_tokens_per_sec * utilization * 3600 / 1e6)
A pod at 10% utilization costs 10× more per token than the same pod at 100%. For bursty workloads, serverless or hosted APIs almost always win.
Pattern that works: baseline traffic on a long-running pod (covers steady-state), burst overflow routed to a hosted API or serverless tier. This is similar to the spot+on-demand pattern in classic cloud — keep utilization high on the cheap tier, eat the markup only on the marginal burst.
6. Provider comparison
Public marketplaces and dedicated GPU clouds, ordered roughly by ops-vs-price spectrum:
- runpod — popular for individuals and small teams. Pods + Serverless products, community templates, lots of GPU types, decent UX. Spot pricing available. Good default for single-developer experiments.
- vast-ai — peer-to-peer marketplace. Cheapest absolute prices but highest variance in reliability, network, and host quality. Good for batch / non-prod.
- lambda-labs — focused on ML, simple pricing, good reputation. Less product surface than RunPod (mostly straight VMs / clusters).
- coreweave — enterprise tier. Reserved capacity, large clusters, higher floor on commitment. Where serious training and large-scale serving end up.
- nebius — newer Europe-anchored GPU cloud, aggressive pricing on H100/H200 reserved capacity.
- together-ai — managed inference as an API, but worth listing here because for many "I want open weights served fast" cases their per-token endpoint is cheaper than self-hosting a low-utilization pod.
Selection heuristic:
- Single developer, experimenting, < $500/mo spend → RunPod or Vast.ai.
- Production workload, want SSH but not enterprise contracts → RunPod (pods + serverless), Lambda.
- Multi-million-token-per-day with ops team → CoreWeave / Nebius reserved capacity.
- "I just want the model behind an HTTP endpoint" → Together.ai or openrouter, not rented GPU.
7. Operational gotchas
- Image cold start dominates serverless cost for >7B models. Bake weights into the image or use a network-mounted weight cache; don't pull from HF on every cold start.
- Spot / interruptible tiers are 50–70% cheaper but you need checkpoint/restart logic. For inference (stateless), this is usually worth it; for fine-tuning, only if your training framework checkpoints frequently.
- Egress is often unmetered on these GPU clouds (unlike AWS/GCP), but verify — large model downloads or video output can surprise you.
- GPU availability is bursty. H100s frequently sell out on consumer-facing providers; reserve capacity if you need guaranteed access for a launch.
- Watch the sandbox/network policy. Some providers block outbound on certain ports, which breaks model downloads or wandb logging.
8. When this becomes the wrong abstraction
Rented GPU at the pod level still leaves you owning serving, autoscaling, observability, model updates, and incident response. If those costs exceed what you save vs. a managed API, you're paying ops tax for no reason. The honest question to ask quarterly: if I deleted my self-hosted stack and routed everything to openrouter / together-ai / direct provider APIs, would I save engineering time worth more than the inference price gap?
For many teams under ~10M tokens/day, the answer is yes.