Stage 3
Why does an 80 GB GPU run out of room before it runs out of compute?
Builds on Stage 2 (decode bottleneck). Bring the bandwidth-equation mental model with you.
An H100 SXM has 80 GB of VRAM. A 70B FP8 model takes about 70 GB once you add activation
overhead. That’s a few gigabytes of headroom — and that headroom is the entire concurrent-user
budget. Every long-context conversation in flight costs gigabytes for its
01 The capacity cliff is real, sharp, and superlinear. §
Hold the model fixed (Llama-3-70B FP8). Hold the GPU fixed (H100 SXM, 80 GB). Hold the precision fixed (KV in FP16). Vary only context length. Read off how many concurrent users the box can host.
same model · same GPU · only context length changed · capacity drops faster than linearly
The drop from 2K to 8K is roughly 4×, which is what you’d expect from “context grew 4×”. From 8K to 32K is another 4×. From 32K to 128K is another 4×. None of that is gentle. The same hardware that comfortably hosts dozens of short-context users hosts a single-digit number at long context. Most teams discover this only when production traffic shifts from chat-style 2K exchanges to RAG-style 32K context windows and users start receiving 503s.
Concurrent capacity scales as roughly 1/context. Same model, same GPU; only the conversation length changed. The cliff is a property of the math, not of any engine choice.
02 What the KV cache actually is. §
Attention works token-by-token. Each layer projects its current token into three vectors: query (Q), key (K), and value (V). Q is consumed immediately. K and V are written into the cache and stay resident — every later token’s attention has to look back at every earlier token’s K and V.
Bytes per token equal 2 × num_layers × num_kv_heads × head_dim × bytes_per_element. The factor
of 2 is K and V; everything else is fixed by the model architecture except the precision. For a
Llama-3-70B model: 80 layers × 8 KV heads × 128 head_dim × 2 bytes (FP16) × 2 (K + V) = 327,680
bytes ≈ 320 KB per token. At 8K context per user that’s ~2.5 GB. At 128K it’s ~40 GB. That’s
per-user.
KV cache stores K and V vectors for every token, every layer, every K/V head. They have to stay resident for as long as the conversation is active. The cost is fixed by architecture and grows linearly with context.
03 The bytes equation, plugged in. §
Pick a model and a precision. The widget reads the live kv_per_token_bytes from the calculator
— the same number that lands in the Memory card on /calculate. Multiply by your context length
and you have the per-user KV footprint.
bytes_per_token = 2 · num_layers · num_kv_heads · head_dim · bytes_per_element
Three things change the per-token number. Architecture:
KV per-token bytes are an architectural fact of the model, modulated by quantisation. The KB number is the same on every GPU, every engine, every batch size.
04 Capacity is what's left after the weights. §
Free VRAM equals total VRAM minus weights minus runtime activation overhead. Whatever’s left is
the entire KV-cache budget. Divide by kv_per_token × context × concurrent and you have either a
fitting deployment or an OOM error — there is no third option.
free_vram = vram − weights − overhead − (kv_per_token · context · concurrent)
The widget plots the breakdown live. Move the context slider from 8K to 32K and watch the green slice (free) collapse. Move the concurrent slider from 1 to 32 and watch the same thing happen linearly. The “max concurrent at this context” readout is the calculator’s answer to “how many streams will this fit?” The status badge is the binary you can’t escape.
free_vram = vram − weights − activation_overhead − (kv_per_token · context · concurrent). The inequality has to clear zero, otherwise no requests at all.
05 Concurrent × context × KV precision is one budget. §
The capacity equation says concurrent and context appear as a product. So does KV precision. Doubling context halves the maximum concurrent users. Switching from FP16 to FP8 KV doubles them. Switching to INT4 quadruples. These are the only three knobs at runtime.
The trade-off explains why long-context products usually serve fewer concurrent users than short-context ones at the same price tier. It explains why “we’ll just lower precision” is a real, often-correct answer to a capacity-bound deployment — KV quant tends to degrade quality later than weight quant, especially with FP8. And it explains why the right answer for an agent that uses 64K context is rarely the same hardware as the right answer for a chat product at 4K context, even when the model is identical.
Three runtime levers — concurrent, context, KV precision — share one budget. Doubling any one of them eats half the room left for the others.
06 Three levers, ranked by leverage. §
Architectural levers (GQA / MLA) live in the model and aren’t yours to pick at runtime, but they
matter enormously for which model you choose to deploy. A model with full
GQA cuts KV by ~8× over MHA · MLA goes further by storing a compressed latent
The bars compare the same prompt on three KV regimes: hypothetical full MHA (Llama-3-70B head geometry × 8 head groups), real GQA (Llama-3-70B as deployed), and real MLA (DeepSeek-V3). MHA to GQA is roughly 8× — the head grouping factor in Llama-3. GQA to MLA is another ~5× — DeepSeek-V3 stores a compressed latent and re-expands at attention time. Combined, MLA is more than 40× lighter than full MHA per token; that’s why long-context production has converged on these two architectures.
The runtime engine layer adds
Architecture (GQA / MLA) sets the per-token floor; precision (FP16 / FP8 / INT4) sets the multiplier; engine paging removes the fragmentation tax. Three layers of saving, each from a different team.
07 Two slogans the equation refuses. §
The honest version of the rule is “capacity is a budget, and you have three sliders that share it.” Reaching for context first is a habit from a world where capacity was free; it isn’t.
“More VRAM” and “less context” are the wrong slogans for a capacity-bound deployment. KV precision is usually the cheapest lever, KV architecture is the hardest, context is the most visible.
Synthesis
- KV cache stores K and V vectors per token per layer per K/V head, multiplied by precision. It’s separate from weights and from activations.
- Per-user KV footprint =
kv_per_token × context_tokens. At 32K context for a 70B GQA model, that’s ~10 GB per user — orders of magnitude larger than people first guess. - Capacity is
free_vram ÷ kv_per_token ÷ context. Doubling any one of context, concurrent, or KV precision halves the room left for the others. - The cliff is superlinear in context because more concurrent users compete for the same shrinking budget. 4× context is roughly 4× fewer concurrent streams.
- Three architectural answers: GQA (the new default), MLA (an order of magnitude better at long context), paged attention (fragmentation removal at the engine layer). Production long-context deployments use all three.
What's next
Stage 3 ends with the capacity equation and three architectural levers. Stage 4 zooms into the optimisation toolbox itself: quantisation, attention variants, MoE, parallelism, batching and engine choice — with the same library backing every decision.