Stack × autoregress
Per-output-token cost = L × per-layer cost. Autoregress pays it every step.
Read more
Every output token pays the full stack cost: L = 80 transformer layers in Llama-3-70B, each with attention + FFN + two norms. The total per-step decode bandwidth is the sum of every per-layer phase × L.
Autoregress means *one token at a time*. To generate 500 output tokens, you do 500 full forward passes, with the KV cache growing by one row of K and one row of V per step. Prefill amortizes weight reads across the input batch; decode cannot — every token re-reads every weight.
This is the synthesis of everything: one block teaches the per-layer math, the stack-multiplier sets the model's depth, and autoregress sets how many times you pay the bill. The product is the cost of a generated response.
Try it: increase output_tokens and watch the per-request decode bandwidth scale linearly.
Try it in the calculator