Vocabulary

The model's text↔vector edge. Token IDs, embedding lookups, and the LM head — every operator that interacts with the vocab dimension.

Token IDs

Text becomes integers from a fixed vocabulary; no compute, no weight bytes.

FLOPs0

Bytes/step0 B

The model never sees raw text. The tokenizer maps a string into integer IDs from a fixed vocabulary V (e.g. 128,256 subwords for Llama-3). Each token is a single int32 — four bytes — regardless of model size. **The work is zero**: pure dispatch into a lookup table.

But the *vocabulary size* you choose here echoes through the rest of the model. Embedding parameters scale as V × hidden_dim, and the LM head re-uses the same V at the end. A larger vocabulary means each token covers more text — fewer tokens per request, shorter sequences — at a constant per-layer cost. Pick V before architecture; everything downstream is sized against it.

Try it in the calculator: change *quantization* and watch how embedding bytes scale, while Token IDs stays at zero.

Try it in the calculator

Embedding table

A token ID picks one row from a V × hidden table — only that row is read.

Params1.05 B

FLOPs0

The embedding table is a learned matrix of shape [V × hidden_dim]. Each row is the dense vector for one token ID — Llama-3 stores 128,256 rows of 8,192 floats, roughly 1B parameters of pure lookup table.

At decode, the model reads exactly one row per token: ~8 KB at fp8, ~16 KB at fp16. That's nothing on a modern GPU — embedding is one of the cheapest decode phases, even though it owns ~3% of the model's params.

The cost shows up in *VRAM*, not bandwidth: the entire table has to live in memory, and a 1B-param table at fp8 is 1 GB resident. Larger vocabularies trade VRAM for shorter sequences.

Try it in the calculator: change *quantization* and watch embedding bytes scale linearly with bits-per-param.

Try it in the calculator

LM head

Same V × hidden as embedding — but here the whole matrix is read every step.

Params1.05 B

FLOPs2.1 B

Bytes/step1.1 GB

The LM head is a [V × hidden] matrix that projects the final hidden state into logits — one score per vocabulary entry. Some models *tie* it to the embedding table; Llama-3 keeps them separate, which is why the model carries ~1B params for embedding *and* ~1B for LM head.

Unlike embedding (one row read per token), the LM head is a real matmul: at decode the GPU streams every weight on every token. That's ~1 GB of HBM traffic per step at fp8 — the single largest bandwidth contribution in the model's data flow, larger than any FFN or attention matmul.

The 2 × params FLOPs land here too — bigger than any one attention head. Compute-bound at prefill, bandwidth-bound at decode.

Try it: switch to a smaller-vocabulary model and watch this number drop linearly.

Try it in the calculator