Model
Hardware
GPUs
Quant
Parallelism
Concurrency
Section 4

FFN

Where the parameters live. Gate / up / down projections own ~70 % of weight bytes in a dense block; MoE makes it conditional.

hidden → intermediate (×2)W_gateW_up

Up projection

Hidden expands ~3.5× via two parallel matmuls — gate and up.

Params18.79 B
FLOPs37.6 B
Bytes/step18.8 GB
Read more

SwiGLU FFN — the dominant FFN family — has two parallel up projections. W_gate produces gate(x), W_up produces up(x), both lifting hidden_dim (8192 in Llama-3) up to intermediate_dim (28,672 ≈ 3.5× hidden). Older GLU-free FFNs had only one up projection of intermediate_dim = 4 × hidden.

Per-step bandwidth dwarfs anything in attention: each of W_gate and W_up is ~5× the size of W_Q. Together they read ~37 GB of weight bytes per token at fp8 — the largest single matmul cost in the model.

The "free" activation (silu(gate) ⊙ up) follows; it costs nothing relative to the matmuls that bracket it.

Try it: change quant from fp8 to int4 and watch these meters halve again.

Try it in the calculator
silu(gate) ⊙ upsilu(·) ⊙

Activation (SwiGLU)

silu(gate) ⊙ up — element-wise product, no weights, ~free.

FLOPs0
Read more

Modern FFNs use a *gated* activation: take the two parallel outputs gate(x) and up(x), apply a non-linearity (SiLU = x · σ(x)) to gate, then multiply element-wise by up. The result feeds the down projection.

This is purely element-wise: ~3 × intermediate_dim FLOPs per token, no weight matrix, no HBM weight reads. Compared to the up and down matmuls that bracket it, it's effectively zero cost.

The architectural reason gated FFNs win: the multiplicative interaction between gate and up gives the network selective per-feature gating, which trains better than a non-gated GeLU/ReLU at the same parameter count. Cost-wise the difference is invisible — it's just two matmuls in front instead of one.

Try it: this phase will stay near zero across every scenario.

Try it in the calculator
intermediate → hiddenW_down

Down projection

Wide intermediate compresses back to hidden — third FFN matmul, same size.

Params18.79 B
FLOPs37.6 B
Bytes/step18.8 GB
Read more

After the gated activation, the FFN sublayer needs to come back down to hidden_dim before adding into the residual stream. W_down is shape [intermediate × hidden] = [28,672 × 8,192] for Llama-3 — same parameter count and same FLOPs as either of the up projections.

Per-token decode bandwidth is ~18 GB at fp8, the third of three FFN matmuls. With gate, up, and down combined, the FFN sublayer reads ~55 GB per decode step — about 4× the entire attention sublayer's weight cost. This is why decode is bandwidth-bound for any modern model: the FFN dominates.

Try it: this phase tracks the up phases exactly. Together they form the model's biggest weight read on every step.

Try it in the calculator
router → top-K expertsrouter

Mixture of Experts

Router picks top-K of N experts per token — active params shrink to K/N.

Active params0
Inactive (VRAM)0
Read more

Mixture-of-Experts replaces the single dense FFN with N parallel expert FFNs (typically 8 to 256) plus a small router that scores experts per token. Only the top-K experts (usually 1 or 2) execute for any given token.

The trade: total params explode (Mixtral-8x7B has 47B total but only ~13B active per token; DeepSeek-V3 has 671B total but only ~37B active). Per-token compute and bandwidth stay sparse — *if* the router is balanced, which it isn't always at inference time.

The router itself is a tiny matmul (hidden × N), and shared experts (DeepSeek style) run on every token to provide a stable baseline. The "inactive" experts still consume VRAM though — total params bound your model size.

Try it: switch to mixtral-8x7b or deepseek-v3 to see the MoE-specific phases populate.

Try it in the calculator