AWS Trainium 2

Amazon's training-grade ASIC — Anthropic's substrate-of-record, the first hyperscaler captive silicon to hit gigawatt scale on a single anchor customer.

1. Core Product / Service

Trainium is AWS's custom training accelerator, designed by Annapurna Labs (acquired by Amazon 2015) and fabbed at tsmc. Generations in 2026:

Trainium 1 — production since 2022, smaller deployments
Trainium 2 — current flagship; ~~1.3 PFLOPS dense FP16 / ~5 PFLOPS sparse FP8 per chip; 96 GB HBM3; deployed in Trn2 UltraServer with 64 chips and Trn2 instances with 16. The substrate of Project Rainier (~~500K Trainium 2 chips) [3]
Trainium 3 — next-gen, 2× perf, ramping in 2026; bundled into Anthropic's $100B commitment alongside Trainium 2
Trainium 4 — announced; targets 6× FP4 perf and 4× generational memory bandwidth uplift [2]

Software: AWS Neuron SDK + Neuron Compiler. Supports PyTorch and JAX; ecosystem narrower than CUDA but maturing fast given Anthropic co-investment.

Distribution: AWS only, never sold as discrete silicon.

2. Target Users & Pain Points

Effectively a one-customer story plus AWS captive use:

Anthropic — primary customer. The April 2026 Amazon-Anthropic expansion commits Anthropic to >$100B AWS spend over 10 years and up to 5 GW of Trainium capacity to train and serve Claude [2]. Project Rainier alone is ~500K Trn2 chips; nearly 1 GW combined Trn2+Trn3 online by end-2026 [3][4].
Amazon-internal AI — Bedrock model fine-tuning, Alexa, Amazon retail recommendation, AGI Labs work
Some AWS customers (Databricks, Ricoh, Datadog, Typhoon AI) use Trn2 instances for training, but volumes are modest compared to Anthropic [1]
Pain solved: roughly half the cost of NVIDIA H100 instances at competitive perf [6]; allocation in a NVIDIA-constrained world
Pain not solved: software portability; smaller library/kernel ecosystem than CUDA

3. Competitive Landscape

Chip	Anchor customer	Cost claim	Distribution
AWS Trainium 2	Anthropic	~50% of comparable NVIDIA H100 [6]	AWS only
google-tpu Trillium	Google + Anthropic	2.1–2.5× perf/$ vs prior gen [Google]	GCP only
microsoft-maia 200	Microsoft + OpenAI	Not disclosed	Azure (limited)
nvidia B200	Everyone	Reference	All clouds + direct
aws-inferentia 2	AWS Bedrock	40% better $/perf vs comparable EC2	AWS only

4. Unique Observations

The Anthropic deep partnership is the entire story. Project Rainier (~500K Trn2 chips, fully operational), >1M Trn2 chips serving Claude inference, and a 10-year/$100B/5GW future commit make Anthropic effectively a co-developer of the chip [3][4][2]. The coupling resembles Microsoft↔OpenAI circa 2023 but on captive silicon rather than NVIDIA. Amazon's investment in Anthropic now totals up to ~$33B (existing $8B + new $5B + up to $20B more on milestones) [2].
"Half the price of NVIDIA" cost claim — Trn2 instances cost roughly 50% of comparable H100 instances per AWS marketing [6]; Anthropic-disclosed unit economics imply ~$20B per gigawatt of effective silicon cost — well below an equivalent NVIDIA buildout. The claim is plausibly true on a captive cost basis but masks the cost of the Neuron-SDK porting effort and the lower software efficiency.
Captive cost vs market alternative. AWS does not publish Trn2 silicon COGS. Industry estimates put it well below NVIDIA ASP — a Trn2 chip has been sized at $5–8K all-in vs $40K for B200. The full TCO picture must include the software-portability tax: if Anthropic ever wants to leave Trainium, it carries a multi-quarter retraining cost.
Anthropic multi-homing. Anthropic also signed the largest google-tpu deal in Google history in November 2025 — meaning Claude runs across NVIDIA + Trainium + TPU. This is the first frontier lab to credibly run a single model on three distinct accelerator architectures, validating that the gpu-kernel-optimization portability story is achievable at frontier scale.
The strategic risk: AWS's whole pitch to Anthropic depends on Anthropic continuing to be the primary captive customer. If Anthropic's training spend grows faster than Trainium's roadmap can deliver, the deal pushes Anthropic back toward NVIDIA — which is exactly what the GCP TPU side-deal hedges against.

5. Financials / Funding

Parent: Amazon (NASDAQ: AMZN)
Trainium revenue: not disclosed separately; embedded in AWS infrastructure segment
AWS Q1 2026 revenue: ~$30B; high-margin, growing low-20s% YoY
Anthropic commitment: >$100B over 10 years, up to 5 GW of Trainium capacity [2]
Project Rainier status: ~500K Trn2 chips, fully operational [3][4]; Trn2+Trn3 nearly 1 GW online by end-2026 [4]
Amazon → Anthropic equity: cumulative up to ~$33B post-2026 expansion [2]

6. People & Relationships

Engineering origin: Annapurna Labs (acquired 2015 for ~$370M) — also designed AWS Graviton CPU and aws-inferentia
AWS CEO: Matt Garman
Compute / silicon: Dave Brown (VP EC2), Gadi Hutt (Annapurna)
Foundry: tsmc
HBM: SK hynix, Samsung
Anchor customer: Anthropic (5 GW commitment) [2]
Other public customers: Databricks, Ricoh, Datadog, Typhoon AI, Lyft
Captive consumers: Bedrock, Alexa, Amazon retail
Sister product: aws-inferentia (inference-optimized variant)
Direct competitors: nvidia google-tpu microsoft-maia amd