Product
AWS Inferentia 2
The inference-optimized half of Amazon's custom-silicon stack — the cheap-token engine under Bedrock, paired with aws-trainium for training.
1. Core Product / Service
Inferentia is AWS's inference accelerator family, also designed by Annapurna Labs. Active SKU:
- Inferentia 2 (Inf2 instances) — 12 NeuronCores per chip, 32 GB HBM, ~190 TFLOPS BF16; 3× higher compute, 4× larger memory, 4× higher throughput, 10× lower latency vs Inferentia 1 [2]
- Up to 40% better $/perf vs comparable EC2 GPU instances on supported workloads [1]
- Designed for transformer inference, embedding generation, real-time recommendation
- Inferentia 3 roadmap not formally announced; Anthropic-driven inference workloads now run primarily on Trainium 2/3 (which has matured into a unified inference+training role), suggesting Inferentia and Trainium may converge.
Software: same Neuron SDK as aws-trainium. Supports PyTorch, TensorFlow, Hugging Face Transformers, with model-compilation step required.
Distribution: AWS only, exposed both as Inf2 instances and as the substrate beneath higher-level services.
2. Target Users & Pain Points
- AWS Bedrock backbone. Bedrock's hosted-model serving (Anthropic Claude, Llama, Mistral, Titan, etc.) runs on a mix of Inferentia 2, Trainium, and NVIDIA — Inferentia is the cheap-token tier for AWS-managed first-party serving.
- High-volume inference workloads on AWS — recommendation systems, search ranking, real-time NLP, embedding services
- Cost-sensitive inference at scale — customers running 100M+ tokens/day where the 40% $/perf delta dwarfs the porting cost
- Pain solved: per-token cost on stable, well-supported model architectures
- Pain not solved: novel model architectures (custom kernels, exotic attention) — these still need NVIDIA flexibility
3. Competitive Landscape
| Inference accelerator | $/token claim | Distribution | Software |
|---|---|---|---|
| AWS Inferentia 2 | 40% better $/perf vs EC2 GPU [1] | AWS only | Neuron SDK |
| aws-trainium 2 (used for inference) | ~50% of H100 instance price [AWS] | AWS only | Neuron SDK |
| google-tpu v6e (Trillium) | 2.1× perf/$ vs v5e | GCP only | XLA |
| microsoft-maia 200 | Cost-targeted, not disclosed | Azure | MS toolchain |
| nvidia B200 / H200 | Reference baseline | Everywhere | CUDA + TensorRT-LLM |
| cerebras CS-3 in Bedrock | ~5× throughput, ~80% cost reduction (Cerebras claim) [4] | AWS Bedrock | Cerebras stack |
4. Unique Observations
- Bedrock backbone is the real product, not the chip. Inferentia 2 succeeds because it's invisible — customers buy "Bedrock" and AWS routes to whichever silicon serves the workload most cheaply. End-customer never has to learn Neuron SDK. This is the inverse of how google-tpu and microsoft-maia are sold (or rather, not sold).
- 40% better $/perf is the wedge. For an inference workload at GPT-3.5/Claude-Haiku class, switching from H100 to Inferentia 2 + Neuron SDK is a one-time 1–2 quarter eng cost, then ongoing savings of ~40%. For Amazon's own first-party services, the math always closes; for third-party Bedrock customers, the savings flow as lower Bedrock pricing they don't know they're getting [1].
- Inference economics in the broader market. Bedrock pricing in 2026 spans roughly $100/mo (light use) to $5,000+/mo (with Agents/KBs/high-throughput) [4]. Batch inference is offered at 50% off. Cross-region inference adds zero surcharge. These knobs are only viable because AWS owns the inference substrate cost.
- Cerebras-in-Bedrock is the strategic precedent. AWS lets a third-party silicon vendor (cerebras) into Bedrock, claiming ~5× throughput and ~80% cost reduction for supported workloads [4]. Inferentia 2 isn't optimal for every model class — particularly very large MoE or exotic architectures — and AWS is willing to backstop with non-NVIDIA, non-Inferentia silicon to maintain the cost-per-token lead.
- The convergence question. Trainium 2 has demonstrated competence at inference (it's serving Claude); Inferentia's distinct identity is increasingly thin. The likely 2027 outcome: a unified Annapurna AI accelerator line where "training" and "inference" are configuration choices, mirroring how google-tpu split (then partially recombined) v5e/v5p.
5. Financials / Funding
- Parent: Amazon (NASDAQ: AMZN)
- Inferentia revenue: not separately disclosed; embedded in AWS segment
- AWS Bedrock pricing: $100/mo–$5,000+/mo customer range [4]; ~50% discount for batch jobs
- Inf2 instance pricing (on-demand, us-east-1): starts ~$0.76/hour for inf2.xlarge up to ~$12.98/hour for inf2.48xlarge (12 chips)
6. People & Relationships
- Engineering origin: Annapurna Labs (Israel) — also designs Trainium and Graviton
- AWS CEO: Matt Garman
- Bedrock leadership: Atul Deo (GM, Bedrock)
- Foundry: tsmc
- HBM: SK hynix, Samsung
- Customers (Inf2 / Bedrock substrate): Bedrock managed-model serving (Anthropic Claude family, Meta Llama, Mistral, AI21, Cohere, Stability, Amazon Titan); direct Inf2 customers include Sprinklr, Money Forward, ByteDance (limited), Adobe (legacy)
- Sister product: aws-trainium
- Adjacent (in Bedrock): cerebras CS-3 binding term sheet
- Direct competitors: google-tpu microsoft-maia nvidia H200/B200, amd MI355X
Last compiled: 2026-05-10