AWS Inferentia 2

The inference-optimized half of Amazon's custom-silicon stack — the cheap-token engine under Bedrock, paired with aws-trainium for training.

1. Core Product / Service

Inferentia is AWS's inference accelerator family, also designed by Annapurna Labs. Active SKU:

Inferentia 2 (Inf2 instances) — 12 NeuronCores per chip, 32 GB HBM, ~190 TFLOPS BF16; 3× higher compute, 4× larger memory, 4× higher throughput, 10× lower latency vs Inferentia 1 [2]
Up to 40% better $/perf vs comparable EC2 GPU instances on supported workloads [1]
Designed for transformer inference, embedding generation, real-time recommendation
Inferentia 3 roadmap not formally announced; Anthropic-driven inference workloads now run primarily on Trainium 2/3 (which has matured into a unified inference+training role), suggesting Inferentia and Trainium may converge.

Software: same Neuron SDK as aws-trainium. Supports PyTorch, TensorFlow, Hugging Face Transformers, with model-compilation step required.

Distribution: AWS only, exposed both as Inf2 instances and as the substrate beneath higher-level services.

2. Target Users & Pain Points

AWS Bedrock backbone. Bedrock's hosted-model serving (Anthropic Claude, Llama, Mistral, Titan, etc.) runs on a mix of Inferentia 2, Trainium, and NVIDIA — Inferentia is the cheap-token tier for AWS-managed first-party serving.
High-volume inference workloads on AWS — recommendation systems, search ranking, real-time NLP, embedding services
Cost-sensitive inference at scale — customers running 100M+ tokens/day where the 40% $/perf delta dwarfs the porting cost
Pain solved: per-token cost on stable, well-supported model architectures
Pain not solved: novel model architectures (custom kernels, exotic attention) — these still need NVIDIA flexibility

3. Competitive Landscape

Inference accelerator	$/token claim	Distribution	Software
AWS Inferentia 2	40% better $/perf vs EC2 GPU [1]	AWS only	Neuron SDK
aws-trainium 2 (used for inference)	~50% of H100 instance price [AWS]	AWS only	Neuron SDK
google-tpu v6e (Trillium)	2.1× perf/$ vs v5e	GCP only	XLA
microsoft-maia 200	Cost-targeted, not disclosed	Azure	MS toolchain
nvidia B200 / H200	Reference baseline	Everywhere	CUDA + TensorRT-LLM
cerebras CS-3 in Bedrock	~5× throughput, ~80% cost reduction (Cerebras claim) [4]	AWS Bedrock	Cerebras stack

4. Unique Observations

Bedrock backbone is the real product, not the chip. Inferentia 2 succeeds because it's invisible — customers buy "Bedrock" and AWS routes to whichever silicon serves the workload most cheaply. End-customer never has to learn Neuron SDK. This is the inverse of how google-tpu and microsoft-maia are sold (or rather, not sold).
40% better $/perf is the wedge. For an inference workload at GPT-3.5/Claude-Haiku class, switching from H100 to Inferentia 2 + Neuron SDK is a one-time 1–2 quarter eng cost, then ongoing savings of ~40%. For Amazon's own first-party services, the math always closes; for third-party Bedrock customers, the savings flow as lower Bedrock pricing they don't know they're getting [1].
Inference economics in the broader market. Bedrock pricing in 2026 spans roughly $100/mo (light use) to $5,000+/mo (with Agents/KBs/high-throughput) [4]. Batch inference is offered at 50% off. Cross-region inference adds zero surcharge. These knobs are only viable because AWS owns the inference substrate cost.
Cerebras-in-Bedrock is the strategic precedent. AWS lets a third-party silicon vendor (cerebras) into Bedrock, claiming ~5× throughput and ~80% cost reduction for supported workloads [4]. Inferentia 2 isn't optimal for every model class — particularly very large MoE or exotic architectures — and AWS is willing to backstop with non-NVIDIA, non-Inferentia silicon to maintain the cost-per-token lead.
The convergence question. Trainium 2 has demonstrated competence at inference (it's serving Claude); Inferentia's distinct identity is increasingly thin. The likely 2027 outcome: a unified Annapurna AI accelerator line where "training" and "inference" are configuration choices, mirroring how google-tpu split (then partially recombined) v5e/v5p.

5. Financials / Funding

Parent: Amazon (NASDAQ: AMZN)
Inferentia revenue: not separately disclosed; embedded in AWS segment
AWS Bedrock pricing: $100/mo–$5,000+/mo customer range [4]; ~50% discount for batch jobs
Inf2 instance pricing (on-demand, us-east-1): starts ~$0.76/hour for inf2.xlarge up to ~$12.98/hour for inf2.48xlarge (12 chips)

6. People & Relationships

Engineering origin: Annapurna Labs (Israel) — also designs Trainium and Graviton
AWS CEO: Matt Garman
Bedrock leadership: Atul Deo (GM, Bedrock)
Foundry: tsmc
HBM: SK hynix, Samsung
Customers (Inf2 / Bedrock substrate): Bedrock managed-model serving (Anthropic Claude family, Meta Llama, Mistral, AI21, Cohere, Stability, Amazon Titan); direct Inf2 customers include Sprinklr, Money Forward, ByteDance (limited), Adobe (legacy)
Sister product: aws-trainium
Adjacent (in Bedrock): cerebras CS-3 binding term sheet
Direct competitors: google-tpu microsoft-maia nvidia H200/B200, amd MI355X