RMSNorm
Rescale each token's hidden vector by its own RMS — element-wise, near-free.
Read more
Each transformer layer pre-norms its inputs to keep activations from drifting in scale: divide by RMS(x) (= √(Σx² / d)), multiply by a learned per-feature scale γ. Llama uses RMSNorm (no mean subtraction); GPT-style models use LayerNorm (subtract mean too).
The math is element-wise: ~3d FLOPs per token (one square per element, one sum, one inverse sqrt, one multiply) and one HBM read of γ which is just hidden_dim parameters. Compared to a hidden×hidden matmul it's effectively free.
There are two of these per transformer block (pre-attention, post-attention) plus a final norm before the LM head — Llama-3-70B has 161 norm operations per token, totaling well under a million parameters.
Try it: this phase will stay near zero across every scenario, even when you change quantization or sequence length.
Try it in the calculator