Model
Hardware
GPUs
Quant
Parallelism
Concurrency
Section 2

Norm

RMSNorm and LayerNorm: tiny per-token rescalings that fire N× per layer. Cheap on their own, cumulative across the stack.

x → x · γ / RMS(x)RMSγ · 1/RMS

RMSNorm

Rescale each token's hidden vector by its own RMS — element-wise, near-free.

Params655.4 K
FLOPs0
Read more

Each transformer layer pre-norms its inputs to keep activations from drifting in scale: divide by RMS(x) (= √(Σx² / d)), multiply by a learned per-feature scale γ. Llama uses RMSNorm (no mean subtraction); GPT-style models use LayerNorm (subtract mean too).

The math is element-wise: ~3d FLOPs per token (one square per element, one sum, one inverse sqrt, one multiply) and one HBM read of γ which is just hidden_dim parameters. Compared to a hidden×hidden matmul it's effectively free.

There are two of these per transformer block (pre-attention, post-attention) plus a final norm before the LM head — Llama-3-70B has 161 norm operations per token, totaling well under a million parameters.

Try it: this phase will stay near zero across every scenario, even when you change quantization or sequence length.

Try it in the calculator