Stage 3

Why does an 80 GB GPU run out of room before it runs out of compute?

Builds on Stage 2 (decode bottleneck). Bring the bandwidth-equation mental model with you.

An H100 SXM has 80 GB of VRAM. A 70B FP8 model takes about 70 GB once you add activation overhead. That’s a few gigabytes of headroom — and that headroom is the entire concurrent-user budget. Every long-context conversation in flight costs gigabytes for its KV cache, growing with every output token. The cliff is real. It’s where most deployment surprises live.

01 The capacity cliff is real, sharp, and superlinear. §

Hold the model fixed (Llama-3-70B FP8). Hold the GPU fixed (H100 SXM, 80 GB). Hold the precision fixed (KV in FP16). Vary only context length. Read off how many concurrent users the box can host.

Llama-3-70B · FP8 weights · FP16 KV · H100 SXM 80 GB · max concurrent users by context length

2,000 ctx

no fit (KV > VRAM)

baseline

8,000 ctx

no fit (KV > VRAM)

off the cliff

32,000 ctx

no fit (KV > VRAM)

off the cliff

128,000 ctx

no fit (KV > VRAM)

off the cliff

same model · same GPU · only context length changed · capacity drops faster than linearly

The drop from 2K to 8K is roughly 4×, which is what you’d expect from “context grew 4×”. From 8K to 32K is another 4×. From 32K to 128K is another 4×. None of that is gentle. The same hardware that comfortably hosts dozens of short-context users hosts a single-digit number at long context. Most teams discover this only when production traffic shifts from chat-style 2K exchanges to RAG-style 32K context windows and users start receiving 503s.

Concurrent capacity scales as roughly 1/context. Same model, same GPU; only the conversation length changed. The cliff is a property of the math, not of any engine choice.

02 What the KV cache actually is. §

Attention works token-by-token. Each layer projects its current token into three vectors: query (Q), key (K), and value (V). Q is consumed immediately. K and V are written into the cache and stay resident — every later token’s attention has to look back at every earlier token’s K and V.

Each decode step appends one new K/V pair per layer per attention head. Earlier tokens stay resident — every later step has to attend back to all of them.

Bytes per token equal 2 × num_layers × num_kv_heads × head_dim × bytes_per_element. The factor of 2 is K and V; everything else is fixed by the model architecture except the precision. For a Llama-3-70B model: 80 layers × 8 KV heads × 128 head_dim × 2 bytes (FP16) × 2 (K + V) = 327,680 bytes ≈ 320 KB per token. At 8K context per user that’s ~2.5 GB. At 128K it’s ~40 GB. That’s per-user.

KV cache stores K and V vectors for every token, every layer, every K/V head. They have to stay resident for as long as the conversation is active. The cost is fixed by architecture and grows linearly with context.

03 The bytes equation, plugged in. §

Pick a model and a precision. The widget reads the live kv_per_token_bytes from the calculator — the same number that lands in the Memory card on /calculate. Multiply by your context length and you have the per-user KV footprint.

Model

KV precision

Context tokens

Attention variant—

KV per token0.0 KB

× context8,000

= per-user KV0.00 GB

bytes_per_token = 2 · num_layers · num_kv_heads · head_dim · bytes_per_element

Three things change the per-token number. Architecture: GQA shrinks the head count; MLA collapses K and V into a low-rank latent. Precision: FP16 → FP8 halves bytes; INT4 quarters them. Context length: linear in tokens. The cache doesn’t care about which tokens are in the conversation — it only cares about how many.

KV per-token bytes are an architectural fact of the model, modulated by quantisation. The KB number is the same on every GPU, every engine, every batch size.

04 Capacity is what's left after the weights. §

Free VRAM equals total VRAM minus weights minus runtime activation overhead. Whatever’s left is the entire KV-cache budget. Divide by kv_per_token × context × concurrent and you have either a fitting deployment or an OOM error — there is no third option.

Model

GPU

Context

Concurrent

Total VRAM

0 GB

Max concurrent at this context

Status

OOM ✗

free_vram = vram − weights − overhead − (kv_per_token · context · concurrent)

The widget plots the breakdown live. Move the context slider from 8K to 32K and watch the green slice (free) collapse. Move the concurrent slider from 1 to 32 and watch the same thing happen linearly. The “max concurrent at this context” readout is the calculator’s answer to “how many streams will this fit?” The status badge is the binary you can’t escape.

free_vram = vram − weights − activation_overhead − (kv_per_token · context · concurrent). The inequality has to clear zero, otherwise no requests at all.

05 Concurrent × context × KV precision is one budget. §

The capacity equation says concurrent and context appear as a product. So does KV precision. Doubling context halves the maximum concurrent users. Switching from FP16 to FP8 KV doubles them. Switching to INT4 quadruples. These are the only three knobs at runtime.

The trade-off explains why long-context products usually serve fewer concurrent users than short-context ones at the same price tier. It explains why “we’ll just lower precision” is a real, often-correct answer to a capacity-bound deployment — KV quant tends to degrade quality later than weight quant, especially with FP8. And it explains why the right answer for an agent that uses 64K context is rarely the same hardware as the right answer for a chat product at 4K context, even when the model is identical.

Three runtime levers — concurrent, context, KV precision — share one budget. Doubling any one of them eats half the room left for the others.

06 Three levers, ranked by leverage. §

Architectural levers (GQA / MLA) live in the model and aren’t yours to pick at runtime, but they matter enormously for which model you choose to deploy. A model with full MHA attention will be unaffordable at 32K context on a single GPU; the same parameter count with MLA will fit comfortably.

KV bytes per token, FP16 cache

MHA

Full MHA (Llama-3-70B style, hypothetical)

0.0 KB / token

baseline

GQA

GQA (Llama-3-70B, real)

0.0 KB / token

MLA

MLA (DeepSeek-V3, real)

0.0 KB / token

GQA cuts KV by ~8× over MHA · MLA goes further by storing a compressed latent

The bars compare the same prompt on three KV regimes: hypothetical full MHA (Llama-3-70B head geometry × 8 head groups), real GQA (Llama-3-70B as deployed), and real MLA (DeepSeek-V3). MHA to GQA is roughly 8× — the head grouping factor in Llama-3. GQA to MLA is another ~5× — DeepSeek-V3 stores a compressed latent and re-expands at attention time. Combined, MLA is more than 40× lighter than full MHA per token; that’s why long-context production has converged on these two architectures.

The runtime engine layer adds paged attention on top: vLLM’s contribution that lets the cache be allocated lazily per token instead of as one contiguous block per request. Paged attention doesn’t reduce total KV bytes — it removes fragmentation, which raises effective concurrency by 20–40% in practice without changing any math.

Architecture (GQA / MLA) sets the per-token floor; precision (FP16 / FP8 / INT4) sets the multiplier; engine paging removes the fragmentation tax. Three layers of saving, each from a different team.

07 Two slogans the equation refuses. §

The honest version of the rule is “capacity is a budget, and you have three sliders that share it.” Reaching for context first is a habit from a world where capacity was free; it isn’t.

“More VRAM” and “less context” are the wrong slogans for a capacity-bound deployment. KV precision is usually the cheapest lever, KV architecture is the hardest, context is the most visible.

Synthesis

KV cache stores K and V vectors per token per layer per K/V head, multiplied by precision. It’s separate from weights and from activations.
Per-user KV footprint = kv_per_token × context_tokens. At 32K context for a 70B GQA model, that’s ~10 GB per user — orders of magnitude larger than people first guess.
Capacity is free_vram ÷ kv_per_token ÷ context. Doubling any one of context, concurrent, or KV precision halves the room left for the others.
The cliff is superlinear in context because more concurrent users compete for the same shrinking budget. 4× context is roughly 4× fewer concurrent streams.
Three architectural answers: GQA (the new default), MLA (an order of magnitude better at long context), paged attention (fragmentation removal at the engine layer). Production long-context deployments use all three.

Open the 32K-context concurrent-user scenario in the calculator ↗

What's next

Stage 3 ends with the capacity equation and three architectural levers. Stage 4 zooms into the optimisation toolbox itself: quantisation, attention variants, MoE, parallelism, batching and engine choice — with the same library backing every decision.