Stage 4.2

Compute levers — MoE, speculative decoding

Stage 4 hub. Stage 4.1 (memory levers) is recommended but not required.

Memory levers shrink the constant in front of the decode equation. Compute levers move a different term: how much work the model does per token. There are only two meaningful answers in the literature — Mixture of Experts and speculative decoding. Both ship in production; both pay rent in different currencies.

01 MoE: active params, not total. §

A dense 70B model reads all 70 billion parameters from HBM on every decode step. A MoE model with 671B total parameters reads only the few experts assigned to the current token — DeepSeek-V3 reads ~37B per token. The decode equation rewards the smaller number; weights memory has to hold the larger.

Decode tok/s scales with active params · memory scales with total

Model	Total	Active	Weights (FP8)	Decode (per user)
Llama-3-70B (dense)	70 B	70 B	—	—
Mixtral 8×22B	141 B	39 B	—	—
DeepSeek V3	671 B	37 B	—	—

Mixtral and DeepSeek run at near-Llama-3-70B decode rates while serving 2× / 10× the knowledge in memory · the price is the second column, paid in VRAM

Read across the table: Mixtral 8×22B at 39B active decodes within a few percent of dense Llama-3-70B’s per-user TPS, but it carries 141B of total weights. DeepSeek-V3 at 37B active hosts at near-Llama-3-70B speeds while serving 671B of effective parameters in memory. The trade-off is bought in VRAM, paid back in either knowledge density (more parameters → more capability per token) or higher per-user TPS for a given knowledge budget.

The routing layer is the trick. Each token’s hidden state goes through a small router that picks the top-K experts. Only those experts’ weights need to be streamed. Done well, the router barely costs anything; done poorly, it routes adversarially and you lose the gain to load imbalance. Open-weight MoE models all use top-2 or top-8 routing with auxiliary load-balancing losses during training.

MoE decouples active params (the bandwidth bill) from total params (the memory bill). Production MoE deployments are bandwidth-comparable to a dense model with the same active count, and capability-comparable to a dense model with the same total count.

02 Deploying MoE: expert parallelism, expert offload. §

The catch is that a 671B model doesn’t fit on one GPU. Production MoE deployments use one of three topologies: tensor parallelism across the entire model (each GPU holds a slice of every expert; works but wastes VRAM as expert sparsity isn’t exploited), expert parallelism (experts are sharded across GPUs; each token is routed to the GPU(s) holding its experts), or expert offload (cold experts live in CPU RAM, hot experts in VRAM, weights are paged in on-demand for less popular tokens).

EP is the default at scale because it preserves the bandwidth advantage. Each forward pass incurs an all-to-all collective for token routing and another to gather expert outputs; in NVLink-domain (8 GPUs, intra-node), the comms cost is small. Beyond that, EP gets harder fast — one of the reasons DeepSeek-V3 specifically targets 8-GPU H100 nodes.

EP is the right MoE topology in NVLink fabrics, paired with FP8 weights. Expert offload is the consumer-grade fallback; TP-everything is the simple but wasteful option.

03 Speculative decoding: verify K tokens at once. §

The decode equation says the GPU does ~1 ms of math wrapped in ~40 ms of memory traffic. Speculative decoding asks: what if those forty milliseconds of weights produced more than one token?

Speculative decoding runs a small draft model — say Llama-3-8B — to produce a candidate sequence of K tokens (K=4 or 8 typically). Then the target model — say Llama-3-70B — runs one forward pass on the candidate as if it were a prompt, in parallel. The candidate is accepted up to the first token that disagrees with the target’s predictions. When the draft is right, one weight-read of the 70B produces 1–8 tokens; when it’s wrong, the same read produces 1.

The acceptance rate matters enormously. With a well-aligned draft (same family, similar training), production speculative decoding achieves 60–80% acceptance at K=4 — meaning effective decode TPS rises by a factor of 2–3×. Without alignment, the draft contradicts the target often enough that the verification cost outweighs the savings.

The cost is two model forward passes per group of tokens, plus draft compute. The draft is small and fast; the verification is one normal forward pass. Net win: most of the bandwidth of one forward pass amortised across the accepted tokens.

Speculative decoding turns one weight read into 1–8 tokens. The savings are bandwidth, paid in extra compute (the draft + the verify). Aligned drafts make this worth ~2–3× decode TPS in production.

04 When compute levers actually pay. §

MoE pays whenever the binding constraint is per-user TPS or knowledge density. It does not help capacity (the KV cache is still computed for every token; MoE doesn’t change that). Speculative decoding pays whenever there’s room in compute (low batch, single user, latency- sensitive workloads); at high batch the GPU is already compute-bound on the verify pass and the savings shrink.

A 32-user concurrent dense Llama-3-70B deployment is in the bandwidth regime per stream and the compute regime per node. Speculative decoding contributes very little here — the bandwidth read is already amortised across 32 users by batching. A 1-user, 70B latency-bound chat is where speculative decoding shines: 2–3× per-user TPS at no extra hardware cost.

Compute levers help when bandwidth is amortised by other means or when compute is sitting idle. They don’t stack with batching the way memory levers do.

05 Folklore the equation refuses. §

Open DeepSeek-V3 FP8 on 8× H100 / 32 users in the calculator ↗

What's next

Memory and compute levers are mostly architectural. The third toolbox page covers system levers — what your serving engine and topology choices buy you on top of any model.