QKV projection
From hidden, three parallel projections produce Q (queries), K (keys), V (values).
Read more
Each transformer block reads the residual hidden vector and projects it through three weight matrices — W_Q, W_K, W_V — to produce queries Q (what each token looks for), keys K (what each token offers), and values V (what each token contributes when matched).
In dense MHA the three matrices are the same shape. In Grouped Query Attention (Llama-3, Qwen-2.5), Q has full width but K and V share heads — Llama-3-70B uses 8 KV heads vs 64 Q heads, so W_K and W_V are 1/8 the size of W_Q. That asymmetry is one of the largest single savings any modern model makes; it pulls KV cache down by the same factor and is what makes long-context decode tractable.
Try it: switch to a model with a different num_kv_heads to watch K/V meters shrink.
Try it in the calculator