Stage 4

Which lever do you reach for?

Builds on Stages 1–3. Bring the decode equation and the capacity equation with you.

By Stage 3 you know the law: decode_tps = bandwidth ÷ (params × bytes_per_param), capacity is free_vram ÷ (kv_per_token × context × concurrent), and the binding constraint is whichever runs out first. Every published optimisation attacks one variable in one of those equations. This stage organises the toolbox by which variable each lever moves.

01 Three regions, three pages. §

Levers fall into three groups. They correspond to the three sources of inference cost:

  • Memory levers reduce bytes-per-decode-step. Less to stream means each step finishes faster. The category is dominated by quantisation, attention variants (GQA / MLA), and KV-cache compression. Stage 3 already touched the last; the Memory levers page goes deep.
  • Compute levers reduce work per token. The two big ones are MoE (active params, not total) and speculative decoding (verify many tokens at once). Both are architectural / algorithmic; neither shows up as a free runtime switch. The Compute levers page covers them.
  • System levers raise utilisation across users. Continuous batching, paged attention, prefix caching, parallelism choice (TP / PP / EP). The same hardware can serve 2–4× the aggregate throughput depending on which engine and topology you run. The System levers page is the runtime-engineer’s page.

A naive deployment ignoring all three will leave 3–10× of effective throughput on the table. Stacked together, the levers compound — but only when the binding constraint is correctly identified first.

Three regions, mapped to three pages. Pick the one matching your binding constraint, or read all three for the full toolbox.

02 A short decision tree. §

Before opening a sub-page, name your binding constraint. The fastest way to pick is the same diagnostic you’d run on an actual deployment.

diagnostic

Q1. Does the model fit on one GPU?

no → System levers (parallelism). Then memory levers (quant) to make it fit on fewer.

yes → keep going.

Q2. At target concurrency, does capacity hold?

no → Memory levers (KV quant, attention variants). Then system levers (paged attention, prefix caching) for fragmentation wins.

yes → keep going.

Q3. Per-user TPS too low?

bandwidth-bound (almost always) → Memory levers (weight quant to halve bytes per param) and compute levers (MoE for fewer active params per token).

Q4. Aggregate TPS too low?

System levers (continuous batching, the right engine, prefix caching).

Identify the binding constraint first. Reaching for the wrong lever is a tax — slower deployment cycles for no measurable improvement.

03 How to read the next three pages. §

Each sub-page follows the same pattern: state the lever, locate it in the equation, walk a worked example with the calculator, name the trade-off, and end with one falsify-a-claim callout. They’re independent — read them in any order, or just the one matching your binding constraint.

The widgets on each page use the same library that backs /calculate, so any number you see in prose can be reproduced (and modified) by clicking through to the calculator with the “open this scenario” link at the end of each page. The numbers will be the same.

Memory, compute, system. The toolbox in three pages, with the same mental model carried forward. Stage 5 puts it all together by walking a real deployment design end-to-end.

What's next

Three sub-pages follow: Memory levers (quant, attention variants, KV compression), Compute levers (MoE, speculative decoding), and System levers (parallelism, batching, engines). Stage 5 then walks a full deployment design start to finish.

Anchor copied