Company

Groq

Self-developed LPU (Language Processing Unit) hardware that pushes inference latency and throughput beyond the limits of the GPU paradigm — the strongest speed-narrative third-party (3P) inference player.

1. Core Product / Service

Groq is one of the few 3P inference companies vertically integrated all the way down to the chip layer:

  • LPU chip (GroqChip): a deterministic tensor stream processor without the GPU's SIMT scheduling. Each chip has small SRAM capacity but extremely high bandwidth; model weights must be sliced across many LPUs. Upside: near-zero stall in token-by-token decoding. Downside: larger models require more LPUs.
  • GroqCloud: serverless API, sold per token. Model menu includes Llama 3.1 / 3.3 (8B/70B), DeepSeek R1 distill, Mixtral, Whisper, Qwen2.5, and other open-source mainstream.
  • GroqRack / On-Prem: packages LPU systems for sale to sovereign clouds, governments, Aramco (strategic partnership 2024), and other large customers.
  • Speed selling point: Llama 3.1 70B at ~250 tok/s output, Llama 3.1 8B at ~750 tok/s. Llama 3.3 70B on GroqCloud has been measured multiple times by third parties at ~250-330 tok/s [4][5], far exceeding the ~50-80 tok/s typical of GPU-based systems. The "500+ tok/s" figure usually refers to 8B-class small-model scenarios.

2. Target Users & Pain Points

  • Latency-sensitive applications: voice agents, real-time translation, long-chain agent workflows (chained multi-LLM calls) — token/s is the key perceptual variable.
  • Enterprise / sovereign deployments: selling LPU racks to Saudi Arabia, Canada sovereign clouds, sidestepping NVIDIA export controls and GPU procurement queues.
  • Pain point: existing GPU inference's latency at batch=1 (low concurrency) is bottlenecked by memory bandwidth; Groq's SRAM design bypasses this bottleneck, so single-stream speed is a structural improvement over the GPU architecture, not just engineering optimization.

3. Competitive Landscape

Competitor Positioning Vs. Groq
cerebras Wafer-scale WSE-3 inference Also non-GPU inference; Cerebras single-wafer performance is more extreme but the ecosystem is narrow; Groq commercialized earlier and broader
NVIDIA H100 + vLLM Mainstream GPU inference route NVIDIA's overwhelming generality + software ecosystem; Groq is niche speed extremity
fireworks-ai / together-ai Software optimization on GPUs Groq wins on speed / latency; Fireworks/Together have broader model menus + more flexible pricing
SambaNova RDU In-house reconfigurable dataflow Also non-GPU; SambaNova focuses on enterprise dedicated; Groq favors public cloud + token API
AWS Trainium / Inferentia Cloud-provider in-house inference chips AWS has distribution; Groq's speed is more aggressive

Differentiation: speed narrative + in-house chip + government/sovereign cloud sales — the only L3b players doing all three simultaneously are Groq + Cerebras + SambaNova.

4. Unique Observations

  • Per-token pricing (GroqCloud, 2026-05): Llama 3.1 8B ~$0.05/M (input + output blended); Llama 3.3 70B ~$0.59/M input + $0.79/M output; DeepSeek R1 distill 70B ~$0.75/M; Whisper Large v3 ~$0.111/hour audio [1]. The "$1.50/M" figure roughly corresponds to 70B-class blended pricing, on par with fireworks-ai / together-ai.
  • Vs first-party (1P) price gap: Llama 3.3 70B @ ~$0.70/M blended vs GPT-4o ~$10/M blended → ~14× price gap; vs Claude Haiku ~$1.6/M → ~2× price gap. Groq's pitch is same price or cheaper, but 5-10× faster.
  • Inference engine: completely in-house + closed source (based on a proprietary LPU compiler stack); doesn't use vLLM or SGLang. Model onboarding cycle is long — this is why Groq's model menu is significantly smaller than Fireworks/Together (~20 vs 100+).
  • Capital model: Groq is a fabless chip company (tapes out at GlobalFoundries 14nm, not dependent on TSMC 5nm); new chip design cycles are slow, with a generational gap vs NVIDIA's H100→B200 cadence. LPU generational catch-up is a real risk.
  • Take rate / cost: in-house chip and rack means a cost structure incomparable to GPU players; no need to pay NVIDIA's 60% margin tax, but must amortize in-house chip R&D + tape-out + rack deployment. Chip depreciation + data center colo conversion into per-token cost is not disclosed.
  • Capacity ramp: in 2024 announced plans to deploy 1M LPUs; actual progress has been constrained by tape-out capacity. GroqCloud has hit severe rate limits multiple times, reflecting capacity as the current growth bottleneck.
  • Saudi Aramco strategy: signed in 2024 to deploy LPU data centers, the key GTM event in Groq translating "speed narrative" into "sovereign AI sales".

5. Financials / Funding

Round Date Amount Valuation Lead
Series C 2021 $300M $1B+ post Tiger Global
Series D 2024-08 $640M $2.8B post BlackRock [2]
Series E (reported) 2025-08 ~$750M ~$6.9B Disruptive (Bloomberg report) [3]
  • Founded: 2016, by Jonathan Ross (former early Google TPU engineer).
  • Total funding: ~$2B+ estimated (including strategic + sovereign customer prepayments).
  • Customers: Aramco (Saudi strategic deployment); public GroqCloud users self-reported at ~2M developers.

6. People & Relationships

  • Founder / CEO: Jonathan Ross — early engineer on Google's TPU project; TPU team alumni are the foundation of Groq's engineering culture.
  • Chief Architect: Dennis Abts (former Google TPU colleague).
  • Investors: BlackRock, Tiger Global, Cisco Investments, Type One Ventures, Samsung Catalyst, KDDI, D1 Capital, Disruptive (reported), Saudi Aramco (strategic), Lee Fixel.
  • Partners: Aramco (sovereign cloud), Meta (early Llama adaptation partner), DeepSeek (R1 distill simultaneous launch).
  • Competes with: cerebras, SambaNova, NVIDIA, fireworks-ai, together-ai.

Sources

Last compiled: 2026-05-10