Stage 1

Why does GPT-4 cost $10 per million output tokens but $2.50 per million input?

Builds on the home page. No prior LLM internals required.

A token costs more on the way out than on the way in. GPT-4 charges $10 per million output tokens and $2.50 per million input. Claude has the same shape; so does Gemini. Every commercial LLM provider does. It looks like a markup. It isn’t. The asymmetry is forced by how transformers run on a GPU. Read this stage and you’ll be able to derive the price yourself — just hardware specs, model size, and arithmetic.

01 Output costs more than input. Always. §

OpenAI charges 4× more for output tokens than input tokens. Anthropic charges 5×. Google, 8×. DeepSeek, 4×. Meta-via-Together is the lone exception, and only because they bill by GPU-hour rather than per-token. The pattern holds across every other commercial provider.

Provider · Model$/M in$/M outOut ÷ In
OpenAI
GPT-4o
$2.50$10.004.0×
Anthropic
Claude Sonnet 4
$3.00$15.005.0×
Google
Gemini 2.5 Pro
$1.25$10.008.0×
Meta
Llama 3.1 70B (vendor-hosted)
$0.60$0.601.0×
DeepSeek
DeepSeek V3
$0.27$1.104.1×
Pay-as-you-go list pricing for general-purpose flagship models, accessed 2026-05. Self-hosted Llama 3.1 70B is the outlier — symmetric pricing because the host charges by GPU-hour rather than per-token.

A reasonable first reaction is suspicion: providers raised the output price because they could. The data here only shows that they all did the same thing, which is what you’d expect from a cartel — or from a physical constraint. The rest of the page argues for the second reading.

Across providers and model families, output tokens cost 4× to 8× more than input tokens. The ratio is too consistent to be a marketing decision.

02 Prefill is parallel; decode is sequential. §

Here’s the asymmetry in one diagram. When you send 500 input tokens, the model processes them all together — the matrix multiplications are wide enough to cover the whole prompt in essentially one parallel pass. When the model produces 500 output tokens, it runs the model 500 times, one pass per token, because each token has to be sampled before the next pass can start.

Prefill — one parallel passAll 8 input tokens go through the model together.one forward pass · ~10s of msDecode — one pass per token6 output tokens require 6 sequential forward passes.6 forward passes · 6× the work
The prefill batch fits in one matmul; decode requires N matmuls in sequence, each one waiting on the previous token to be sampled. Same model, very different time.
Model
GPU
Playback speed
Wall clock0 msPrefill — 24 tokens, one parallel pass0 msTTFTDecode — 24 tokens, one at a time, 0 ms apart0.0 ssame model · same GPU · prefill is 0× faster than decode for the same 24-token payload
Bar lengths render to wall-clock scale. Prefill processes all 24 tokens in one ~0 ms parallel pass. Decode emits 24 tokens at ~0 ms each — that's the asymmetry behind the price card.

This is the architectural truth behind the price. Same model, same GPU, same prompt. The number of forward passes through the network differs by hundreds. Output costs more because output is more work.

500 output tokens require 500 sequential forward passes; 500 input tokens require roughly one parallel pass. The work isn’t the same.

03 Decode rate is set by HBM bandwidth, not compute. §

If the work were 500× as much you’d expect output tokens to be 500× more expensive, not 4–8×. They aren’t, because each forward pass during decode is much cheaper than the prefill pass. To see why, look at what a single decode step actually does on a GPU.

Model
GPU
Weights stream
0.0 GB
Per-step time
0.0 ms
Decode rate
0.0 tok/s

decode_step_ms ≈ weights_bytes / hbm_bandwidth · decode_tps = 1000 / decode_step_ms

A 70B model in FP16 weighs about 140 GB. To produce one token the GPU has to stream those weights from HBM through the matrix multipliers and back. On an H100 SXM that takes ~42 ms per step — about 24 tok/s for a single user. Switch GPUs in the widget; the step time tracks the bandwidth column on each spec sheet, almost exactly. Compute (TFLOPs) barely moves the number. Bandwidth does.

This is the key idea you’ll spend the next stage exploring. For now: decode tokens are cheap-per- step but slow-per-user. Many short waits, one after another.

Decode tok/s is bounded by how fast the GPU can stream the weights once per token — decode_tps ≈ HBM bandwidth ÷ model bytes.

04 From throughput to a per-million-token price. §

Cost is rented GPU time divided by tokens produced. If you’re paying a few dollars per hour for a GPU and decoding at tens of tokens per second, every output token costs you that hourly rate divided by tokens-per-hour. Rescale to a million tokens and you get $/M output. The same arithmetic on prefill TPS gives $/M input. Two numbers, one formula.

Model
GPU
$/hr$0.00
÷ decode tok/s0.0
× 1e6 ÷ 3600 →$0.00 /M out

÷ prefill tok/s0
× 1e6 ÷ 3600 →$0.00 /M in

$/M = ($/hr ÷ tokens-per-second) × (1e6 ÷ 3600) — same formula, two throughputs.

The formula is the whole pricing story. A single GPU dollar-per-hour figure plus two throughput numbers (prefill TPS and decode TPS) determine the entire price card. The $/hr is set by the cloud; the throughputs are set by hardware specs and model architecture. Nothing else enters.

Cost per million tokens equals dollars per GPU-hour divided by tokens-per-second, rescaled by 3,600 seconds per hour and a million-tokens-per-million.

05 Why output is 4–8× more expensive — exactly. §

The ratio of output cost to input cost equals the ratio of prefill TPS to decode TPS. That ratio is a property of the hardware-and-model combination, not the provider. For a 70B model on an H100 it sits around 100×–500× depending on prompt length and engine; commercial providers price the output at only 4–8× the input. They’re already absorbing most of the asymmetry into the input price.

Model
GPU
$/M input · $0.00
$/M output · $0.00
Output ÷ Input0.0×

Pick different model + GPU combinations in the widget and watch the ratio. It moves with hardware choice but stays in the same general band. The provider’s job is to find a price card customers will accept; the shape of that card — output more expensive than input — is fixed before the provider gets a vote.

Provider pricing reflects the prefill-versus-decode asymmetry, with the input price subsidising the output price by a factor that varies but rarely shrinks below 2×.

06 Folklore: "OpenAI marks output up because they can." §

The takeaway isn’t that providers are charity. It’s that the compute asymmetry forces the shape of the price card before any business decision happens. Output tokens are physically more expensive to produce, and every provider compresses that fact into the same general-purpose 4–8× ratio because the market won’t accept 100×.

The “output is marked up” framing has the sign backwards: relative to true compute cost, output is underpriced. The asymmetry survives because customers already think 4× is steep.

Synthesis

  • Tokens are sub-word units priced per million.
  • Output costs more than input because output tokens are generated sequentially while input is processed in parallel. 500 output tokens = 500 forward passes; 500 input tokens ≈ one pass.
  • Decode rate is set by HBM bandwidth, not compute. That’s the next stage.
  • $/M = ($/hr ÷ tokens-per-second) × (1e6 ÷ 3600). Same formula for input and output; different throughput plugged in.
  • The 4–8× ratio across providers reflects physics, not markup. If anything, it understates the true cost asymmetry.
Open this stage's primary scenario in the calculator ↗

What's next

Stage 1 ended at “decode is bandwidth-bound, full stop.” Stage 2 proves it from a spec sheet: why a 70B model is more than 8× slower per token than an 8B model on the same GPU, the roofline test that names the regime, and the four knobs every inference optimisation paper turns.

Anchor copied