Stage 4.1
Memory levers — quantisation, KV compression, attention variants
Stage 4 hub. Read Stage 2 (decode equation) and Stage 3 (capacity equation) first.
Stage 2 made the case that decode is bandwidth-bound. So the levers with the most leverage are the ones that reduce bytes streamed per decode step. Three of them — weight quantisation, KV cache quantisation, and attention variants — together account for most of the throughput gains shipped in inference engines over the last three years.
01 Weight quantisation: halve the bytes per param. §
Weight precision sets bytes_per_param directly. FP16 → FP8 halves it; FP8 → INT4 halves it
again. Decode TPS roughly doubles per halving, exactly per the equation. The catch is quality.
Modern model families — Llama-3, Qwen-2.5, DeepSeek — survive FP8 with no measurable quality loss on standard benchmarks. INT4 starts to bite: arithmetic problems, reasoning chains, code generation regress 1–3% by typical evals; subjective coherence regresses more. INT4 with state-of-the-art schemes (AWQ, GPTQ, QuaRot) recovers most of the gap; FP16 → INT4 with naive RTN does not.
The practical default for production has settled at FP8 weights. INT4 is for hardware-budget- bound deployments — single-GPU 70B-class serving, edge inference, latency-critical loops where the quality loss is acceptable.
FP8 is free for current open-weight models. INT4 is a real trade-off — meaningful in single- GPU serving and edge, marginal in multi-GPU production where TP buys the same memory back without the quality cost.
02 KV cache quantisation: halve the cache. §
Same arithmetic, applied to the KV cache: per-token bytes drop from 2 (FP16) to 1 (FP8) to 0.5 (INT4). Capacity (max concurrent users at a given context) goes up by the same factor. Per-user TPS changes only slightly because the KV is small relative to weights at low concurrency, and the equation is dominated by the weight read.
bytes_per_token = 2 · num_layers · num_kv_heads · head_dim · bytes_per_element
The quality story is different from weight quant. Quantising K/V vectors directly affects attention scores; aggressive KV compression on long context can produce repetition or forgetting failure modes. FP8 KV is widely used in production and tracks FP16 closely; INT4 KV tends to require asymmetric quant or per-token scales to avoid degradation past ~8K context.
FP8 KV is the production default — same capacity gain as the weight quant did to memory. INT4 KV is workable below 8K context; above it, watch for attention degradation in long-form outputs.
03 Attention variants: shrink the K/V geometry itself. §
Quantisation halves the constant in front of the per-token bytes. Architecture changes the constant itself. Three architectural moves matter for KV size:
MHA (
): every Q head has its own K and V heads. Original transformer geometry. Largest cache.multi-head attention GQA (
): a small group of K/V heads serves many Q heads. 8× cache reduction in Llama-3 / Mistral.grouped-query attention MLA (
): K and V are stored as a low-rank latent and re-expanded per head at attention time. Another ~5× smaller than GQA.multi-head latent attention
GQA cuts KV by ~8× over MHA · MLA goes further by storing a compressed latent
Sliding-window attention and other forms of sparse attention attack the length multiplier rather than the per-token constant. The KV cache only needs to retain the last W tokens’ vectors, where W is the window. Mistral-7B used a 4K window; production-relevant work on long context now layers sparse attention over MLA for further wins.
These are model choices, not runtime ones. You don’t pick GQA at deploy time; the model was trained with it. The lesson is that long-context production has converged on MLA and GQA because the alternatives don’t hold up at 32K+ context with realistic concurrency.
Architecture sets the floor. Quantisation moves the multiplier. Pick a model with the right attention variant first; quantise second.
04 The compounded trade-off. §
The three levers compound multiplicatively. FP8 KV halves per-token cache bytes (2× capacity); MLA cuts another ~5× on top; FP8 weights free several GB more of headroom for KV — together ~10× the concurrent users at long context over an FP16 / FP16 / GQA baseline, plus ~2× decode TPS from the weight-precision halving. That’s why DeepSeek-V3 at FP8 on H100 SXM serves long-context workloads that Llama-3-70B at FP16 simply can’t host.
Diminishing returns kick in past two halvings. INT4 weights with INT4 KV with sliding-window attention is the limit of useful compounding; beyond that, quality regression compounds too. The serious teams ship FP8/FP8 by default and reach for INT4 only when constrained.
Memory levers compound up to a point. The production sweet spot in 2025–2026 is FP8 weights
- FP8 KV + GQA or MLA architecture; everything else is a context-specific exception.
05 Folklore the equation refuses. §
What's next
Memory levers move bytes per decode step. Compute levers move work per decode step — MoE, speculative decoding, and the architectural choices that decouple effective parameters from per-token cost.