Model
Hardware
GPUs
Quant
Parallelism
Concurrency
Section 5

Stack × autoregress

Why decode reads everything every step. The KV cache as cumulative state; the autoregressive loop as the cost engine.

L × layer · per output tokenL layers

Stack × autoregress

Per-output-token cost = L × per-layer cost. Autoregress pays it every step.

bytes / step / layer855.7 MB
bytes / step · model69.5 GB
Read more

Every output token pays the full stack cost: L = 80 transformer layers in Llama-3-70B, each with attention + FFN + two norms. The total per-step decode bandwidth is the sum of every per-layer phase × L.

Autoregress means *one token at a time*. To generate 500 output tokens, you do 500 full forward passes, with the KV cache growing by one row of K and one row of V per step. Prefill amortizes weight reads across the input batch; decode cannot — every token re-reads every weight.

This is the synthesis of everything: one block teaches the per-layer math, the stack-multiplier sets the model's depth, and autoregress sets how many times you pay the bill. The product is the cost of a generated response.

Try it: increase output_tokens and watch the per-request decode bandwidth scale linearly.

Try it in the calculator