Company

Together AI

AI-native cloud for open-source model inference, fine-tuning, and dedicated NVIDIA GPU clusters — built around the Together Kernel Collection from FlashAttention author Tri Dao.

1. 核心产品 / 服务

Three product lines on a single platform:

  • Together Inference — serverless API for 200+ open-source LLMs (Llama, Qwen, Mistral, DeepSeek, etc.), token-priced. Claims #1 output speed on demanding open models, with up to 2x faster serverless inference via FlashAttention-4 kernels, fused MoE kernels, and FP8/FP4 quantization that's "effectively lossless" [1].
  • Together Fine-Tuning — full fine-tunes plus LoRA on open weights, customer keeps the weights. Targeted at production workloads, not just experimentation.
  • Together GPU Clusters — two tiers: Instant GPU Clusters (self-service, up to 64 NVIDIA Hopper GPUs, spin up in minutes via console, GA'd Sept 2025) and Dedicated GPU Clusters (64–1,000 GPUs, custom-configured, supports Skypilot/Terraform IaC). Expanding Blackwell (B200/GB200) deployments announced at GTC 2025 [2].

Core technical moat is the Together Inference Engine (proprietary, closed-source) built under Chief Scientist Tri Dao (FlashAttention creator). It is built on FlashAttention-3/4 kernels plus custom speculative decoding and quantization, and claims ~4x decode throughput vs open-source vLLM [5]. The strategic nuance: FlashAttention itself is open-source and an industry-standard kernel used even by competitors (the same families of kernels show up in open engines like vLLM/SGLang) — so Together's model is "open-source foundation + closed-source monetization," contrasting with Fireworks' fully closed FireAttention. The broader kernel work ships as the Together Kernel Collection — hardware-aware kernels for attention, MoE routing, and low-bit quantization.

2. 服务对象 & 痛点

  • Open-source-first AI startups & enterprises that don't want to be locked into closed API providers (OpenAI/Anthropic) and want Llama/Qwen/DeepSeek-class quality.
  • Pain points solved: hosting open models in-house requires GPU procurement, kernel tuning, autoscaling, eval — Together absorbs that. Pricing positioned ~11x cheaper than GPT-4 for Llama-3 equivalents [3].
  • Burst training / fine-tuning: customers who need 64–1000 GPUs for a few weeks and don't want to commit to AWS/Azure reserved capacity.

3. 竞争格局

Competitor Positioning Vs. Together
Fireworks AI Token-priced inference, FireAttention engine Direct competitor; Fireworks claims lower latency on some workloads, Together has broader model catalog + GPU cluster tier
Anyscale Ray-native, RayTurbo, enterprise governance Anyscale is more "infrastructure framework", Together is more "API product"
Modal Commodity GPU host, per-second billing Modal = bare-metal control for devs; Together = managed inference + training stack
runpod Per-minute GPU rental, broad accelerator menu RunPod is raw GPU; Together adds inference engine + fine-tuning UX
lambda-labs Training-optimized GPU cloud Lambda is more training/research focused, Together covers full inference→training loop
coreweave Hyperscale GPU IaaS, NVIDIA-aligned CoreWeave is wholesale GPU capacity (often Together's underlying supplier-class peer); Together sits a layer up as managed AI platform
openrouter Aggregator/router across providers OpenRouter routes traffic to Together (and others); they're complements more than competitors

Differentiation: Tri Dao's kernel work + the only player offering serverless tokens and dedicated 1000-GPU clusters under one console.

Positioning — "AI-native cloud" straddling neocloud + inference API. Together sits between pure inference-API vendors (Fireworks) and pure GPU "neoclouds" (coreweave, nebius, lambda-labs). Per Sacra/Contrary estimates, per-token API/inference is only ~30–40% of revenue (analyst estimate); the larger ~60–70% is GPU server/cluster rental (training, fine-tuning, dedicated serving) [6]. That makes Together the most token-leaning neocloud — more inference-API exposure than coreweave/nebius, but more raw-GPU-rental revenue than a pure API like Fireworks. See ai-inference-engines for engine-layer context.

4. 独特观察

  • The Tri Dao hire is the load-bearing piece of the technical story — FlashAttention is foundational to every modern inference stack, so "we ship the kernels first" is a credible moat narrative vs. Fireworks/Anyscale.
  • Bet that open-source models stay competitive enough that enterprises want a neutral host. If frontier closed models (GPT-5, Claude 5) keep stretching the lead and open models stagnate, Together's TAM compresses. Continued strength of deepseek / Llama / Qwen is existential.
  • Strategic positioning between coreweave (wholesale GPUs) and OpenAI/Anthropic (closed APIs) — Together is the "Snowflake of open AI inference" pitch.
  • Self-service Instant Clusters (Sept 2025) is a meaningful product expansion — moves Together from "API vendor" toward "Vercel-for-GPUs" UX. See ai-inference-engines and gpu-kernel-optimization for the technical context.
  • Heavy NVIDIA partnership (NVIDIA is on the cap table, Blackwell early access at GTC 2025) — fortunes are correlated with NVIDIA roadmap.

5. 财务 / 融资

  • Founded: June 2022.
  • Series B (Feb 2025): $305M led by General Catalyst, co-led by Prosperity7. Valuation $3.3B, up >160% from the $1.25B post Salesforce-led $106M round in March 2024 [4].
  • New raise (Mar 2026, reported): in talks to raise ~$1B at a ~$7.5B valuation [7] — roughly 2.3x the $3.3B Series B mark of just over a year prior. (Supersedes the earlier "seeking ~$1B follow-on per DCD 2025" report, which appears to be the same financing track at a now-firmer valuation.)
  • Total raised: ~$534M as of Feb 2025 (pre the reported 2026 round).
  • Revenue (2026): ~$1B annualized as of Feb 2026, up from ~$618M at end of 2025 [6] — i.e. roughly doubling inside the prior ~12 months.
  • Revenue mix (analyst estimate, Sacra/Contrary): ~30–40% per-token API/inference, ~60–70% GPU server/cluster rental [6].
  • Investors: General Catalyst, Prosperity7, Salesforce Ventures, NVIDIA, Kleiner Perkins, Coatue, Lux Capital, Greycroft, Emergence, March Capital, SK Telecom, John Chambers, Scott Banister, DAMAC Capital.

6. People & Relationships

  • Founders: Vipul Ved Prakash (CEO — serial founder, prior exits in search/data infra), Ce Zhang, Chris Ré, Percy Liang, and Tri Dao — founded Together June 2022. A notably academic-heavy founding bench (Ré/Liang Stanford, Dao FlashAttention).
  • Chief Scientist: Tri Dao — FlashAttention author, Princeton CS, technical credibility anchor.
  • Lead investors: General Catalyst, Prosperity7, NVIDIA (strategic).
  • Cooperates with: openrouter (as upstream provider), NVIDIA (Blackwell early access).
  • Competes with: Fireworks AI, Anyscale, Modal, runpod, lambda-labs (overlapping zones).
  • Hosts models from: deepseek, Meta (Llama), Alibaba (Qwen), Mistral.

Sources

Last compiled: 2026-06-28