Groq

Self-developed LPU (Language Processing Unit) hardware that pushes inference latency and throughput beyond the limits of the GPU paradigm — the strongest speed-narrative third-party (3P) inference player.

1. Core Product / Service

Groq is one of the few 3P inference companies vertically integrated all the way down to the chip layer:

LPU chip (GroqChip): a deterministic tensor stream processor without the GPU's SIMT scheduling. Each chip has small SRAM capacity but extremely high bandwidth; model weights must be sliced across many LPUs. Upside: near-zero stall in token-by-token decoding. Downside: larger models require more LPUs.
GroqCloud: serverless API, sold per token. Model menu includes Llama 3.1 / 3.3 (8B/70B), DeepSeek R1 distill, Mixtral, Whisper, Qwen2.5, and other open-source mainstream.
GroqRack / On-Prem: packages LPU systems for sale to sovereign clouds, governments, Aramco (strategic partnership 2024), and other large customers.
Speed selling point: Llama 3.1 70B at ~250 tok/s output, Llama 3.1 8B at ~750 tok/s. Llama 3.3 70B on GroqCloud has been measured multiple times by third parties at ~250-330 tok/s [4][5], far exceeding the ~50-80 tok/s typical of GPU-based systems. The "500+ tok/s" figure usually refers to 8B-class small-model scenarios.

2. Target Users & Pain Points

Latency-sensitive applications: voice agents, real-time translation, long-chain agent workflows (chained multi-LLM calls) — token/s is the key perceptual variable.
Enterprise / sovereign deployments: selling LPU racks to Saudi Arabia, Canada sovereign clouds, sidestepping NVIDIA export controls and GPU procurement queues.
Pain point: existing GPU inference's latency at batch=1 (low concurrency) is bottlenecked by memory bandwidth; Groq's SRAM design bypasses this bottleneck, so single-stream speed is a structural improvement over the GPU architecture, not just engineering optimization.

3. Competitive Landscape

Competitor	Positioning	Vs. Groq
cerebras	Wafer-scale WSE-3 inference	Also non-GPU inference; Cerebras single-wafer performance is more extreme but the ecosystem is narrow; Groq commercialized earlier and broader
NVIDIA H100 + vLLM	Mainstream GPU inference route	NVIDIA's overwhelming generality + software ecosystem; Groq is niche speed extremity
fireworks-ai / together-ai	Software optimization on GPUs	Groq wins on speed / latency; Fireworks/Together have broader model menus + more flexible pricing
SambaNova RDU	In-house reconfigurable dataflow	Also non-GPU; SambaNova focuses on enterprise dedicated; Groq favors public cloud + token API
AWS Trainium / Inferentia	Cloud-provider in-house inference chips	AWS has distribution; Groq's speed is more aggressive

Differentiation: speed narrative + in-house chip + government/sovereign cloud sales — the only L3b players doing all three simultaneously are Groq + Cerebras + SambaNova.

4. Unique Observations

Per-token pricing (GroqCloud, 2026-05): Llama 3.1 8B ~$0.05/M (input + output blended); Llama 3.3 70B ~$0.59/M input + $0.79/M output; DeepSeek R1 distill 70B ~$0.75/M; Whisper Large v3 ~$0.111/hour audio [1]. The "$1.50/M" figure roughly corresponds to 70B-class blended pricing, on par with fireworks-ai / together-ai.
Vs first-party (1P) price gap: Llama 3.3 70B @ ~$0.70/M blended vs GPT-4o ~$10/M blended → ~14× price gap; vs Claude Haiku ~$1.6/M → ~2× price gap. Groq's pitch is same price or cheaper, but 5-10× faster.
Inference engine: completely in-house + closed source (based on a proprietary LPU compiler stack); doesn't use vLLM or SGLang. Model onboarding cycle is long — this is why Groq's model menu is significantly smaller than Fireworks/Together (~20 vs 100+).
Capital model: Groq is a fabless chip company (tapes out at GlobalFoundries 14nm, not dependent on TSMC 5nm); new chip design cycles are slow, with a generational gap vs NVIDIA's H100→B200 cadence. LPU generational catch-up is a real risk.
Take rate / cost: in-house chip and rack means a cost structure incomparable to GPU players; no need to pay NVIDIA's 60% margin tax, but must amortize in-house chip R&D + tape-out + rack deployment. Chip depreciation + data center colo conversion into per-token cost is not disclosed.
Capacity ramp: in 2024 announced plans to deploy 1M LPUs; actual progress has been constrained by tape-out capacity. GroqCloud has hit severe rate limits multiple times, reflecting capacity as the current growth bottleneck.
Saudi Aramco strategy: signed in 2024 to deploy LPU data centers, the key GTM event in Groq translating "speed narrative" into "sovereign AI sales".

5. Financials / Funding

Round	Date	Amount	Valuation	Lead
Series C	2021	$300M	$1B+ post	Tiger Global
Series D	2024-08	$640M	$2.8B post	BlackRock [2]
Series E (reported)	2025-08	~$750M	~$6.9B	Disruptive (Bloomberg report) [3]

Founded: 2016, by Jonathan Ross (former early Google TPU engineer).
Total funding: ~$2B+ estimated (including strategic + sovereign customer prepayments).
Customers: Aramco (Saudi strategic deployment); public GroqCloud users self-reported at ~2M developers.

6. People & Relationships

Founder / CEO: Jonathan Ross — early engineer on Google's TPU project; TPU team alumni are the foundation of Groq's engineering culture.
Chief Architect: Dennis Abts (former Google TPU colleague).
Investors: BlackRock, Tiger Global, Cisco Investments, Type One Ventures, Samsung Catalyst, KDDI, D1 Capital, Disruptive (reported), Saudi Aramco (strategic), Lee Fixel.
Partners: Aramco (sovereign cloud), Meta (early Llama adaptation partner), DeepSeek (R1 distill simultaneous launch).
Competes with: cerebras, SambaNova, NVIDIA, fireworks-ai, together-ai.

Sources

[1] https://groq.com/pricing/ (2026-05-10)
[2] https://wow.groq.com/news_press/groq-raises-640m-series-d/ (2026-05-10)
[3] https://www.bloomberg.com/news/articles/2025-08-27/groq-said-to-raise-funds-at-roughly-6-9-billion-valuation (2026-05-10)
[4] https://groq.com/blog/groq-on-llama-3-1-70b (2026-05-10)
[5] https://artificialanalysis.ai/providers/groq (2026-05-10)