Model
Hardware
GPUs
Quant
Parallelism
Concurrency
Section 1

Vocabulary

The model's text↔vector edge. Token IDs, embedding lookups, and the LM head — every operator that interacts with the vocab dimension.

text → ids 1857 2917 3589 18549 423 9106 vocab V = 128k 0 FLOPs · pure dispatch

Token IDs

Text becomes integers from a fixed vocabulary; no compute, no weight bytes.

FLOPs0
Bytes/step0 B
Read more

The model never sees raw text. The tokenizer maps a string into integer IDs from a fixed vocabulary V (e.g. 128,256 subwords for Llama-3). Each token is a single int32 — four bytes — regardless of model size. **The work is zero**: pure dispatch into a lookup table.

But the *vocabulary size* you choose here echoes through the rest of the model. Embedding parameters scale as V × hidden_dim, and the LM head re-uses the same V at the end. A larger vocabulary means each token covers more text — fewer tokens per request, shorter sequences — at a constant per-layer cost. Pick V before architecture; everything downstream is sized against it.

Try it in the calculator: change *quantization* and watch how embedding bytes scale, while Token IDs stays at zero.

Try it in the calculator
id → embedding id 1857 V × hidden embedding 1 row of 128k · ~8 KB / step

Embedding table

A token ID picks one row from a V × hidden table — only that row is read.

Params1.05 B
FLOPs0
Read more

The embedding table is a learned matrix of shape [V × hidden_dim]. Each row is the dense vector for one token ID — Llama-3 stores 128,256 rows of 8,192 floats, roughly 1B parameters of pure lookup table.

At decode, the model reads exactly one row per token: ~8 KB at fp8, ~16 KB at fp16. That's nothing on a modern GPU — embedding is one of the cheapest decode phases, even though it owns ~3% of the model's params.

The cost shows up in *VRAM*, not bandwidth: the entire table has to live in memory, and a 1B-param table at fp8 is 1 GB resident. Larger vocabularies trade VRAM for shorter sequences.

Try it in the calculator: change *quantization* and watch embedding bytes scale linearly with bits-per-param.

Try it in the calculator
hidden → logits hidden × W_lm_head V × hidden · full read logits · V wide argmax every weight read · ~1 GB / step

LM head

Same V × hidden as embedding — but here the whole matrix is read every step.

Params1.05 B
FLOPs2.1 B
Bytes/step1.1 GB
Read more

The LM head is a [V × hidden] matrix that projects the final hidden state into logits — one score per vocabulary entry. Some models *tie* it to the embedding table; Llama-3 keeps them separate, which is why the model carries ~1B params for embedding *and* ~1B for LM head.

Unlike embedding (one row read per token), the LM head is a real matmul: at decode the GPU streams every weight on every token. That's ~1 GB of HBM traffic per step at fp8 — the single largest bandwidth contribution in the model's data flow, larger than any FFN or attention matmul.

The 2 × params FLOPs land here too — bigger than any one attention head. Compute-bound at prefill, bandwidth-bound at decode.

Try it: switch to a smaller-vocabulary model and watch this number drop linearly.

Try it in the calculator