Up projection
Hidden expands ~3.5× via two parallel matmuls — gate and up.
Read more
SwiGLU FFN — the dominant FFN family — has two parallel up projections. W_gate produces gate(x), W_up produces up(x), both lifting hidden_dim (8192 in Llama-3) up to intermediate_dim (28,672 ≈ 3.5× hidden). Older GLU-free FFNs had only one up projection of intermediate_dim = 4 × hidden.
Per-step bandwidth dwarfs anything in attention: each of W_gate and W_up is ~5× the size of W_Q. Together they read ~37 GB of weight bytes per token at fp8 — the largest single matmul cost in the model.
The "free" activation (silu(gate) ⊙ up) follows; it costs nothing relative to the matmuls that bracket it.
Try it: change quant from fp8 to int4 and watch these meters halve again.
Try it in the calculator