Product

AWS Inferentia 2

The inference-optimized half of Amazon's custom-silicon stack — the cheap-token engine under Bedrock, paired with aws-trainium for training.

1. Core Product / Service

Inferentia is AWS's inference accelerator family, also designed by Annapurna Labs. Active SKU:

  • Inferentia 2 (Inf2 instances) — 12 NeuronCores per chip, 32 GB HBM, ~190 TFLOPS BF16; 3× higher compute, 4× larger memory, 4× higher throughput, 10× lower latency vs Inferentia 1 [2]
  • Up to 40% better $/perf vs comparable EC2 GPU instances on supported workloads [1]
  • Designed for transformer inference, embedding generation, real-time recommendation
  • Inferentia 3 roadmap not formally announced; Anthropic-driven inference workloads now run primarily on Trainium 2/3 (which has matured into a unified inference+training role), suggesting Inferentia and Trainium may converge.

Software: same Neuron SDK as aws-trainium. Supports PyTorch, TensorFlow, Hugging Face Transformers, with model-compilation step required.

Distribution: AWS only, exposed both as Inf2 instances and as the substrate beneath higher-level services.

2. Target Users & Pain Points

  • AWS Bedrock backbone. Bedrock's hosted-model serving (Anthropic Claude, Llama, Mistral, Titan, etc.) runs on a mix of Inferentia 2, Trainium, and NVIDIA — Inferentia is the cheap-token tier for AWS-managed first-party serving.
  • High-volume inference workloads on AWS — recommendation systems, search ranking, real-time NLP, embedding services
  • Cost-sensitive inference at scale — customers running 100M+ tokens/day where the 40% $/perf delta dwarfs the porting cost
  • Pain solved: per-token cost on stable, well-supported model architectures
  • Pain not solved: novel model architectures (custom kernels, exotic attention) — these still need NVIDIA flexibility

3. Competitive Landscape

Inference accelerator $/token claim Distribution Software
AWS Inferentia 2 40% better $/perf vs EC2 GPU [1] AWS only Neuron SDK
aws-trainium 2 (used for inference) ~50% of H100 instance price [AWS] AWS only Neuron SDK
google-tpu v6e (Trillium) 2.1× perf/$ vs v5e GCP only XLA
microsoft-maia 200 Cost-targeted, not disclosed Azure MS toolchain
nvidia B200 / H200 Reference baseline Everywhere CUDA + TensorRT-LLM
cerebras CS-3 in Bedrock ~5× throughput, ~80% cost reduction (Cerebras claim) [4] AWS Bedrock Cerebras stack

4. Unique Observations

  • Bedrock backbone is the real product, not the chip. Inferentia 2 succeeds because it's invisible — customers buy "Bedrock" and AWS routes to whichever silicon serves the workload most cheaply. End-customer never has to learn Neuron SDK. This is the inverse of how google-tpu and microsoft-maia are sold (or rather, not sold).
  • 40% better $/perf is the wedge. For an inference workload at GPT-3.5/Claude-Haiku class, switching from H100 to Inferentia 2 + Neuron SDK is a one-time 1–2 quarter eng cost, then ongoing savings of ~40%. For Amazon's own first-party services, the math always closes; for third-party Bedrock customers, the savings flow as lower Bedrock pricing they don't know they're getting [1].
  • Inference economics in the broader market. Bedrock pricing in 2026 spans roughly $100/mo (light use) to $5,000+/mo (with Agents/KBs/high-throughput) [4]. Batch inference is offered at 50% off. Cross-region inference adds zero surcharge. These knobs are only viable because AWS owns the inference substrate cost.
  • Cerebras-in-Bedrock is the strategic precedent. AWS lets a third-party silicon vendor (cerebras) into Bedrock, claiming ~5× throughput and ~80% cost reduction for supported workloads [4]. Inferentia 2 isn't optimal for every model class — particularly very large MoE or exotic architectures — and AWS is willing to backstop with non-NVIDIA, non-Inferentia silicon to maintain the cost-per-token lead.
  • The convergence question. Trainium 2 has demonstrated competence at inference (it's serving Claude); Inferentia's distinct identity is increasingly thin. The likely 2027 outcome: a unified Annapurna AI accelerator line where "training" and "inference" are configuration choices, mirroring how google-tpu split (then partially recombined) v5e/v5p.

5. Financials / Funding

  • Parent: Amazon (NASDAQ: AMZN)
  • Inferentia revenue: not separately disclosed; embedded in AWS segment
  • AWS Bedrock pricing: $100/mo–$5,000+/mo customer range [4]; ~50% discount for batch jobs
  • Inf2 instance pricing (on-demand, us-east-1): starts ~$0.76/hour for inf2.xlarge up to ~$12.98/hour for inf2.48xlarge (12 chips)

6. People & Relationships

  • Engineering origin: Annapurna Labs (Israel) — also designs Trainium and Graviton
  • AWS CEO: Matt Garman
  • Bedrock leadership: Atul Deo (GM, Bedrock)
  • Foundry: tsmc
  • HBM: SK hynix, Samsung
  • Customers (Inf2 / Bedrock substrate): Bedrock managed-model serving (Anthropic Claude family, Meta Llama, Mistral, AI21, Cohere, Stability, Amazon Titan); direct Inf2 customers include Sprinklr, Money Forward, ByteDance (limited), Adobe (legacy)
  • Sister product: aws-trainium
  • Adjacent (in Bedrock): cerebras CS-3 binding term sheet
  • Direct competitors: google-tpu microsoft-maia nvidia H200/B200, amd MI355X
Last compiled: 2026-05-10