Stage 1
Why does GPT-4 cost $10 per million output tokens but $2.50 per million input?
Builds on the home page. No prior LLM internals required.
A token costs more on the way out than on the way in. GPT-4 charges $10 per million output tokens and $2.50 per million input. Claude has the same shape; so does Gemini. Every commercial LLM provider does. It looks like a markup. It isn’t. The asymmetry is forced by how transformers run on a GPU. Read this stage and you’ll be able to derive the price yourself — just hardware specs, model size, and arithmetic.
01 Output costs more than input. Always. §
OpenAI charges 4× more for output tokens than input tokens. Anthropic charges 5×. Google, 8×. DeepSeek, 4×. Meta-via-Together is the lone exception, and only because they bill by GPU-hour rather than per-token. The pattern holds across every other commercial provider.
| Provider · Model | $/M in | $/M out | Out ÷ In |
|---|---|---|---|
OpenAI GPT-4o ↗ | $2.50 | $10.00 | 4.0× |
Anthropic Claude Sonnet 4 ↗ | $3.00 | $15.00 | 5.0× |
Google Gemini 2.5 Pro ↗ | $1.25 | $10.00 | 8.0× |
Meta Llama 3.1 70B (vendor-hosted) ↗ | $0.60 | $0.60 | 1.0× |
DeepSeek DeepSeek V3 ↗ | $0.27 | $1.10 | 4.1× |
A reasonable first reaction is suspicion: providers raised the output price because they could. The data here only shows that they all did the same thing, which is what you’d expect from a cartel — or from a physical constraint. The rest of the page argues for the second reading.
Across providers and model families, output tokens cost 4× to 8× more than input tokens. The ratio is too consistent to be a marketing decision.
02 Prefill is parallel; decode is sequential. §
Here’s the asymmetry in one diagram. When you send 500 input tokens, the model processes them all together — the matrix multiplications are wide enough to cover the whole prompt in essentially one parallel pass. When the model produces 500 output tokens, it runs the model 500 times, one pass per token, because each token has to be sampled before the next pass can start.
This is the architectural truth behind the price. Same model, same GPU, same prompt. The number of forward passes through the network differs by hundreds. Output costs more because output is more work.
500 output tokens require 500 sequential forward passes; 500 input tokens require roughly one parallel pass. The work isn’t the same.
03 Decode rate is set by HBM bandwidth, not compute. §
If the work were 500× as much you’d expect output tokens to be 500× more expensive, not 4–8×. They aren’t, because each forward pass during decode is much cheaper than the prefill pass. To see why, look at what a single decode step actually does on a GPU.
decode_step_ms ≈ weights_bytes / hbm_bandwidth · decode_tps = 1000 / decode_step_ms
A 70B model in FP16 weighs about 140 GB. To produce one token the GPU has to stream those weights
from
This is the key idea you’ll spend the next stage exploring. For now: decode tokens are cheap-per- step but slow-per-user. Many short waits, one after another.
Decode tok/s is bounded by how fast the GPU can stream the weights once per token —
decode_tps ≈ HBM bandwidth ÷ model bytes.
04 From throughput to a per-million-token price. §
Cost is rented GPU time divided by tokens produced. If you’re paying a few dollars per hour for a GPU and decoding at tens of tokens per second, every output token costs you that hourly rate divided by tokens-per-hour. Rescale to a million tokens and you get $/M output. The same arithmetic on prefill TPS gives $/M input. Two numbers, one formula.
$/M = ($/hr ÷ tokens-per-second) × (1e6 ÷ 3600) — same formula, two throughputs.
The formula is the whole pricing story. A single GPU dollar-per-hour figure plus two throughput numbers (prefill TPS and decode TPS) determine the entire price card. The $/hr is set by the cloud; the throughputs are set by hardware specs and model architecture. Nothing else enters.
Cost per million tokens equals dollars per GPU-hour divided by tokens-per-second, rescaled by 3,600 seconds per hour and a million-tokens-per-million.
05 Why output is 4–8× more expensive — exactly. §
The ratio of output cost to input cost equals the ratio of prefill TPS to decode TPS. That ratio is a property of the hardware-and-model combination, not the provider. For a 70B model on an H100 it sits around 100×–500× depending on prompt length and engine; commercial providers price the output at only 4–8× the input. They’re already absorbing most of the asymmetry into the input price.
Pick different model + GPU combinations in the widget and watch the ratio. It moves with hardware choice but stays in the same general band. The provider’s job is to find a price card customers will accept; the shape of that card — output more expensive than input — is fixed before the provider gets a vote.
Provider pricing reflects the prefill-versus-decode asymmetry, with the input price subsidising the output price by a factor that varies but rarely shrinks below 2×.
06 Folklore: "OpenAI marks output up because they can." §
The takeaway isn’t that providers are charity. It’s that the compute asymmetry forces the shape of the price card before any business decision happens. Output tokens are physically more expensive to produce, and every provider compresses that fact into the same general-purpose 4–8× ratio because the market won’t accept 100×.
The “output is marked up” framing has the sign backwards: relative to true compute cost, output is underpriced. The asymmetry survives because customers already think 4× is steep.
Synthesis
- Tokens are sub-word units priced per million.
- Output costs more than input because output tokens are generated sequentially while input is processed in parallel. 500 output tokens = 500 forward passes; 500 input tokens ≈ one pass.
- Decode rate is set by HBM bandwidth, not compute. That’s the next stage.
$/M = ($/hr ÷ tokens-per-second) × (1e6 ÷ 3600). Same formula for input and output; different throughput plugged in.- The 4–8× ratio across providers reflects physics, not markup. If anything, it understates the true cost asymmetry.
What's next
Stage 1 ended at “decode is bandwidth-bound, full stop.” Stage 2 proves it from a spec sheet: why a 70B model is more than 8× slower per token than an 8B model on the same GPU, the roofline test that names the regime, and the four knobs every inference optimisation paper turns.