Stage 4.1

Memory levers — quantisation, KV compression, attention variants

Stage 4 hub. Read Stage 2 (decode equation) and Stage 3 (capacity equation) first.

Stage 2 made the case that decode is bandwidth-bound. So the levers with the most leverage are the ones that reduce bytes streamed per decode step. Three of them — weight quantisation, KV cache quantisation, and attention variants — together account for most of the throughput gains shipped in inference engines over the last three years.

01 Weight quantisation: halve the bytes per param. §

Weight precision sets bytes_per_param directly. FP16 → FP8 halves it; FP8 → INT4 halves it again. Decode TPS roughly doubles per halving, exactly per the equation. The catch is quality.

Modern model families — Llama-3, Qwen-2.5, DeepSeek — survive FP8 with no measurable quality loss on standard benchmarks. INT4 starts to bite: arithmetic problems, reasoning chains, code generation regress 1–3% by typical evals; subjective coherence regresses more. INT4 with state-of-the-art schemes (AWQ, GPTQ, QuaRot) recovers most of the gap; FP16 → INT4 with naive RTN does not.

The practical default for production has settled at FP8 weights. INT4 is for hardware-budget- bound deployments — single-GPU 70B-class serving, edge inference, latency-critical loops where the quality loss is acceptable.

FP8 is free for current open-weight models. INT4 is a real trade-off — meaningful in single- GPU serving and edge, marginal in multi-GPU production where TP buys the same memory back without the quality cost.

02 KV cache quantisation: halve the cache. §

Same arithmetic, applied to the KV cache: per-token bytes drop from 2 (FP16) to 1 (FP8) to 0.5 (INT4). Capacity (max concurrent users at a given context) goes up by the same factor. Per-user TPS changes only slightly because the KV is small relative to weights at low concurrency, and the equation is dominated by the weight read.

Model

KV precision

Context tokens

Attention variant—

KV per token0.0 KB

× context8,000

= per-user KV0.00 GB

bytes_per_token = 2 · num_layers · num_kv_heads · head_dim · bytes_per_element

The quality story is different from weight quant. Quantising K/V vectors directly affects attention scores; aggressive KV compression on long context can produce repetition or forgetting failure modes. FP8 KV is widely used in production and tracks FP16 closely; INT4 KV tends to require asymmetric quant or per-token scales to avoid degradation past ~8K context.

FP8 KV is the production default — same capacity gain as the weight quant did to memory. INT4 KV is workable below 8K context; above it, watch for attention degradation in long-form outputs.

03 Attention variants: shrink the K/V geometry itself. §

Quantisation halves the constant in front of the per-token bytes. Architecture changes the constant itself. Three architectural moves matter for KV size:

MHA (multi-head attention): every Q head has its own K and V heads. Original transformer geometry. Largest cache.
GQA (grouped-query attention): a small group of K/V heads serves many Q heads. 8× cache reduction in Llama-3 / Mistral.
MLA (multi-head latent attention): K and V are stored as a low-rank latent and re-expanded per head at attention time. Another ~5× smaller than GQA.

KV bytes per token, FP16 cache

MHA

Full MHA (Llama-3-70B style, hypothetical)

0.0 KB / token

baseline

GQA

GQA (Llama-3-70B, real)

0.0 KB / token

MLA

MLA (DeepSeek-V3, real)

0.0 KB / token

GQA cuts KV by ~8× over MHA · MLA goes further by storing a compressed latent

Sliding-window attention and other forms of sparse attention attack the length multiplier rather than the per-token constant. The KV cache only needs to retain the last W tokens’ vectors, where W is the window. Mistral-7B used a 4K window; production-relevant work on long context now layers sparse attention over MLA for further wins.

These are model choices, not runtime ones. You don’t pick GQA at deploy time; the model was trained with it. The lesson is that long-context production has converged on MLA and GQA because the alternatives don’t hold up at 32K+ context with realistic concurrency.

Architecture sets the floor. Quantisation moves the multiplier. Pick a model with the right attention variant first; quantise second.

04 The compounded trade-off. §

The three levers compound multiplicatively. FP8 KV halves per-token cache bytes (2× capacity); MLA cuts another ~5× on top; FP8 weights free several GB more of headroom for KV — together ~10× the concurrent users at long context over an FP16 / FP16 / GQA baseline, plus ~2× decode TPS from the weight-precision halving. That’s why DeepSeek-V3 at FP8 on H100 SXM serves long-context workloads that Llama-3-70B at FP16 simply can’t host.

Diminishing returns kick in past two halvings. INT4 weights with INT4 KV with sliding-window attention is the limit of useful compounding; beyond that, quality regression compounds too. The serious teams ship FP8/FP8 by default and reach for INT4 only when constrained.

Memory levers compound up to a point. The production sweet spot in 2025–2026 is FP8 weights

FP8 KV + GQA or MLA architecture; everything else is a context-specific exception.

05 Folklore the equation refuses. §

Open DeepSeek-V3 FP8/FP8 / 16 users / 32K in the calculator ↗

What's next

Memory levers move bytes per decode step. Compute levers move work per decode step — MoE, speculative decoding, and the architectural choices that decouple effective parameters from per-token cost.