FLOPs / token · decode
0
Bytes / token · decode
0
VRAM held (weights)
0
"What is photosynthesis?"the user's inputWhat·is·photosynthesis?text → 6 sub-word tokens · 'photosynthesis' splits
vocab × hiddenWhat·is·photosynthesis?token ID → hidden vector · ~free FLOPsTransformer block× 80 layers── pre-norm · RMSNorm ──Attention↻ RoPE on Q,KQKVQ·Kᵀsoftmax·VW_oK cacheV cacherow appended →each decode stepaccumulates ↑residual── post-norm · RMSNorm ──FFNSwiGLU · gate ⊙ up · ~half FLOPsingateupgate ⊙ updownoutresidualweights stream every token · 0.9 GB / layer · 3.35 TB/sVRAM · 160 GB · 2× H100fits ✓Weights 70 GBKV 5.2 GBActivations 8 GBFree 76.8 GB── final · RMSNorm ──hidden vectorLM head · vocab × hiddenvocab logitssoftmax over vocab → top-k pick"Plants"[001234]"Plants"ID → text fragment · appended to output↻ decode loopback× 500 output tokens · KV cache grows by one row each step
Atlas sections ↓
Atlas — token economics

Where every FLOP, byte, and VRAM byte lives.

Five sections, ~17 lesson cards. Each card answers "what does this operator cost, and why?" against the scenario above.

Spine — cost forensics

Five questions, in order.

Each stage starts from one hook question and ends with a synthesis you can articulate. Walked in order, the path goes from "I think LLM costs are arbitrary" to "I can derive the price card from a hardware spec."

  1. Stage 1 · 12 min skim

    Why does GPT-4 cost $10 per million output tokens but $2.50 per million input?

    Output tokens cost more than input tokens at every LLM provider. The asymmetry isn't a markup — it's forced by how transformers run on a GPU. Derive the price from first principles.

  2. Stage 2 · 14 min skim

    Why is a 70B model more than 8× slower per token than an 8B model?

    Decode runs once per token and reads every weight from HBM each step. Compute hardly enters. Derive the per-token speed from a spec sheet.

  3. Stage 3 · 14 min skim

    Why does an 80 GB GPU only serve 30 concurrent users?

    Weights fit in a third of the VRAM. The rest fills with KV cache faster than anyone expects. Derive the capacity cliff that limits every long-context deployment.

  4. Stage 4 · 8 min skim

    Which lever do you reach for?

    An organised catalogue of inference levers — quantisation, attention variants, MoE, parallelism, batching, engines — grouped by what they actually move in the decode equation.

  5. Stage 5 · 20 min skim

    Given my SLO and budget, what should I run?

    Take a real product spec and walk it to a defensible deployment. The decode and capacity equations from earlier stages now do the work — every choice is one line of arithmetic, not a guess.