Stage 2

Why is a 70B model more than 8× slower per token than an 8B model?

Builds on Stage 1 (token economy). Bring the prefill-vs-decode mental model with you.

An H100 will do a quadrillion floating-point operations per second. A 70B model needs about 140 billion of them to produce one output token. The arithmetic predicts thousands of tokens per second. Real systems hit twenty-five. The gap isn’t waste — it’s a different kind of cost the GPU is paying, and it explains every inference optimisation paper of the last five years.

01 The GPU has compute to spare. §

Pick a model. The numbers below are spec-sheet only — nothing measured, no benchmarks. Just two multiplications and a divide.

Pick a model

Model parameters70 B

FLOPs per output token (≈ 2N)140 GFLOP

H100 SXM peak compute989 TFLOP/s

Naive: tokens/s if compute-bound7,064 tok/s

Reality: tokens/s if bandwidth-bound23.9 tok/s

Gap295× too optimistic

FLOPs/token ≈ 2 × params · naive_tps = peak_flops ÷ flops_per_token

The headline of every H100 datasheet — 989 TFLOP/s of FP16 matrix throughput — implies that any 70B model should decode at roughly seven thousand tokens per second. That isn’t what providers deliver. OpenAI’s Llama-3-70B clones run in the tens of tokens per second per user. So one of two things is true: either every implementation is leaving 99% of the GPU on the table, or the “compute-only” calculation is wrong about what decode is. The rest of this stage argues the second.

Compute peak is ~30 to ~280× larger than per-token FLOPs across model sizes. If decode were bottlenecked on math, dense throughput would land in the thousands per second. It doesn’t.

02 Same hardware. Hugely different speeds. §

Hold the GPU constant. Hold quantisation constant. Vary only the model. The numbers below come from this site’s calculator, which ships the same library the CLI uses, and which is validated against real benchmark traces (/validate).

Same H100 SXM · FP16 · 1 concurrent user · decode tok/s per user

Llama-3-8B

OOM on 1× H100

baseline

Llama-3-70B

OOM on 1× H100

needs TP

Llama-3-405B

OOM on 1× H100

needs TP

same GPU · same precision · only the model changed

Llama-3-8B at single-user concurrency hits a few hundred tokens per second. Llama-3-70B does about twenty-five — roughly 9× slower for 9× the parameters. Llama-3-405B doesn’t fit on a single H100 in FP16 (810 GB of weights, 80 GB of VRAM); on the TP-4 minimum the projected per-user rate lands in single digits, extending the same near-linear trend. Same precision, same batch, same hardware — only the parameter count changed. The slowdown is fingerprinted on something that scales linearly with weights.

Decode TPS scales as 1/N where N is parameter count. Compute capacity should make the speeds-from-spec-sheet identical; they’re not.

03 The GPU does ~1 ms of math and waits ~40 ms for weights. §

If compute isn’t the bottleneck, what is? Decode produces one output token by running the entire forward pass. That pass needs every weight in the model. On a GPU, weights live in HBM. Before any matrix multiply happens, the weights have to be streamed onto the chip’s compute units and out the other side. That stream takes time.

Model

GPU

one decode step ≈ 0.0ms · bandwidth dominates by —

Model

GPU

Playback speed

decode_step_ms ≈ 0 GB / 0 GB/s + 0.00 ms compute = 0.0 ms · animation slowed 6× · ratio preserved

For Llama-3-70B at FP16 on H100 SXM: compute time is around 0.3 ms. Bandwidth time is around 42 ms. The ratio is more than two orders of magnitude. The GPU’s matrix units are sitting idle for 99.3% of every decode step, waiting for the next batch of weights to arrive. That’s where the seven-thousand vs twenty-five gap goes.

One decode step is ~1 ms of math wrapped in ~40 ms of memory traffic. The matrix units idle while the HBM works.

04 On the roofline, decode lives below the floor. §

The roofline plot is the canonical visualisation. It puts arithmetic intensity (FLOPs per byte) on the X-axis and achieved FLOP/s on the Y-axis. Two ceilings: a sloped line representing the bandwidth limit (you can do at most bandwidth × intensity FLOPs per second) and a flat line representing the compute peak. They cross at the critical intensity.

GPU

Decode weight precision

Peak compute

989 TFLOP/s

HBM bandwidth

3.35 TB/s

Critical I

295 FLOP/byte

decode I ≈ 1.00 · 295× below the threshold · the GPU's compute ceiling is unreachable from this regime

For an H100 SXM the critical intensity is ~295 FLOPs/byte. Decode at FP16 runs at about 1 FLOP per byte (each weight is 2 bytes; each weight contributes 2 FLOPs to one token’s forward pass). So decode sits almost three hundred times below the bandwidth/compute crossover. The compute ceiling is unreachable from this regime. Even if NVIDIA doubled FP16 throughput tomorrow, decode TPS for a 70B FP16 model would be unchanged — the 989 → 1978 TFLOP/s number is irrelevant when arithmetic intensity is 1.

Switch to FP8 in the picker and the decode point doubles in intensity (same FLOPs, half the bytes streamed). It’s still ~150× below the threshold. Decode is fundamentally a bandwidth regime; quant shifts the constant, not the regime.

Decode lives at ~1 FLOP/byte. Critical I on H100 is ~295. The compute peak might as well not exist for this kernel; bandwidth is the only ceiling that bites.

05 The decode equation, with no asterisks. §

Now the law, in one line. To produce one output token the model must read every active parameter once. That’s a fixed amount of bytes. The HBM bandwidth tells you how fast bytes move. Time = bytes ÷ bandwidth. Tokens per second is one over that.

Model

GPU

Weight precision

step_time =

params70 B

× bytes/param2

÷ bandwidth3.35 TB/s

= step time0.0 ms

decode_tps =

1000 ÷ step time0.0 tok/s

step_time = (params × bytes_per_param) ÷ bandwidth — that's the whole law

Three inputs: parameter count (in the model), bytes per parameter (set by quantisation), and bandwidth (set by GPU). Multiply, divide, done. No assumptions about engine; no fudge for kernel scheduling. Vendors and engine teams have spent five years collapsing the gap between this prediction and what real systems hit, and the best implementations are now within ~20% of it. The line below step_time = is the law everyone is converging towards.

The widget exposes the calculation. Pick a model — 70B becomes 405B and the step time grows exactly proportionally. Pick a quantisation — FP16 to FP8 halves bytes-per-param and the step time halves with it. Pick a GPU — H100 to B200 grows bandwidth by ~2.4× and the step time falls by the same factor. Three knobs, one equation, and the entire decode story falls out.

step_time = (params × bytes_per_param) ÷ bandwidth. decode_tps = 1000 ÷ step_time. That’s it.

06 Four knobs to turn. That's the whole optimisation surface. §

The decode equation has three terms. So the levers available to anyone trying to improve decode speed live in those three terms — plus one more (batch) that applies when there are concurrent users.

Quantisation

Architecture

GPU

Batch

Decode (per user)

0.0 tok/s

Decode (aggregate)

0 tok/s

$/M output

$0.00

live · same library that backs /calculate

Quantisation halves bytes per parameter. FP16 → FP8 is roughly a 2× decode speedup; FP8 → INT4 is another 2×, with diminishing returns from quality loss past INT4. The widget makes this exact.
Architecture changes which parameters are active per token. Switch from dense Llama-3-70B to MoE DeepSeek-V3 and the per-user decode rate climbs because only ~37B of the 671B parameters are read per token. Memory cost stays at the full 671B, so the trade-off is “active params (decode speed) vs total params (memory + capacity).” MoE is the only way you’ll ever get a 671B model decoding at 70B speeds.
GPU is the bandwidth knob. H100 SXM (3.35 TB/s) → H200 (4.8 TB/s) → B200 (8.0 TB/s) is a trajectory of pure bandwidth at roughly fixed FP8 compute. This is the lever NVIDIA has been pulling.
Batch is more subtle. Stream weights once, decode many tokens at once. Per-user TPS is roughly fixed; aggregate TPS scales with batch up to a compute or KV-cache ceiling. We’ll come back to this in Stage 3.

Almost every inference paper of the last five years attacks one of these knobs, sometimes two. “Speculative decoding” is a fifth knob, but it’s a clever workaround on the bandwidth law, not a new lever — it skips reads when the model is highly confident and pays a verification cost. Stage 4 returns to the toolbox.

Quant halves bytes. MoE shrinks active params. Newer GPUs push bandwidth. Batching amortises one weight read across users. Every inference optimisation paper attacks one of these four levers — and they’re all in service of the bandwidth equation.

07 Folklore the equation refuses. §

The decode equation refutes a lot of received wisdom about inference. Two bigger ones:

Folklore says

"Quantisation gives a 4× speedup."

The math says

The headline number depends on what quant you started from. FP16 → INT4 is roughly 4× because bytes-per-param halves twice. FP16 → FP8 is roughly 2×. FP8 → INT4 is another 2× with quality costs. For MoE models the speedup applies only to the active parameter slice; the dense layers (router, attention) keep their FP16 cost.

The gain is also bounded by other things in the pipeline. KV-cache quant gives a separate, often larger speedup at high concurrency by freeing capacity for more concurrent streams (Stage 3). Conflating “weights INT4” with “4× faster” is the most common misunderstanding here, and it leads to disappointed deployments when a 4× projection delivers 2× in practice.

Sweep FP16 / FP8 / INT4 on Llama-3-70B ↗

The two together explain why “newer hardware + lower precision” pull in the same direction: both are bandwidth wins. Compute wins exist (prefill, training), but they don’t help the kernel that spends 99% of its time waiting for HBM.

Compute headlines are easy to print. Bandwidth headlines tell the truth about decode.

Synthesis

Decode is bandwidth-bound. The compute ceiling is inaccessible at decode-realistic arithmetic intensities. Every modern GPU sits at a critical intensity ~75–300× above what decode runs (the range maps to INT4 → FP16; FP8 sits in the middle at ~150×).
The decode equation: step_time = (params × bytes_per_param) ÷ bandwidth. decode_tps = 1000 ÷ step_time. Three inputs; one law.
Four levers exist: quantisation, architecture (active params), GPU (bandwidth), and batch (amortising one read across users). Every inference paper of the last five years attacks one of these.
“Faster GPU” is misleading; “higher-bandwidth GPU” predicts decode TPS. The H100 → H200 → B200 trajectory is a pure-bandwidth play at near-flat FP16 compute.
Quantisation gives ~2× per halving of bytes-per-param, with diminishing quality returns past INT4 and a smaller effect on the dense parts of MoE models.

Open Llama-3-70B FP16 on H100 SXM in the calculator ↗

What's next

Stage 2 ended with the four-knob model of decode TPS and a hard rule: bandwidth wins. Stage 3 picks up the second decode constraint — KV cache. It’s the reason a model that fits in VRAM can still fail to serve concurrent users, and it makes the difference between a deployment that scales and one that hits a capacity cliff at the wrong moment.