Validation

Predicted vs measured throughput on public benchmarks. Sources are clickable on every row; known failure modes are named below.

Coverage at a glance

Loading coverage…

Confidence levels

Every confidence badge on /calculate and /compare links here. The level now reflects direct benchmark support, same-regime support, or lack of support, not generic hand-wavy heuristics.

LevelConfiguration profileTargetFailure modes
HighDense model · NVIDIA datacenter GPU · FP16/BF16/FP8/INT8 weights · vLLM or TRT-LLM · batch ≥ 8±30%None known in this regime
MediumGGUF on llama.cpp · Apple Silicon · MoE batch ≥ 16 · hybrid attention · TP=8 · context ≤ 8K±50%MFU calibration · expert load imbalance
LowMoE batch < 8 · GGUF on TRT-LLM · context > 8K · MFU/bw_eff overridden far from presetNoneSee known failure modes

Benchmarks

Loading benchmarks…

Known failure modes

Where the math breaks, named. Each card explains the regime, the numerical impact, and how to read predictions in that zone. Out-of-target ⚠ icons in the table above link directly to the matching card.

MoE small-batch decode

Partially mitigated

Our bandwidth-formula prediction overestimates throughput at low concurrency for MoE models because expert-routing dispatch and kernel-launch overhead aren't modelled. The simulator assumes weights stream at full HBM bandwidth on every step; in practice MoE engines spend much of every step bouncing tokens between experts.

Numerical impact

  • DeepSeek-V3 single-user: simulator predicts ~46k tok/s aggregate, measured ~620 tok/s — about 75× overshoot.
  • PRD v2's two-component decode formula reduces the gap (single-user: v1 37× off → v2 ~11× off).

When to trust

  • Treat the prediction as an upper bound until concurrency reaches the engine's batch-fill threshold (vLLM: ~16; TRT-LLM: ~8).
  • At batch ≥ 16 the formula converges back into the ±50% medium-confidence band.

Underspecified vendor benchmarks

Known

Some vendor blogs publish a tok/s number without ISL, OSL, or concurrency, leaving the prediction-vs-measured comparison ambiguous. We estimate the missing fields from context, then flag the row's confidence as low. Treat these as sanity bounds, not precise targets.

Numerical impact

  • vllm-llama3.1-8b-bf16-h100-chat: ISL/OSL guessed from 'chat workload' wording; predicted/measured gap ~45%.

When to trust

  • Compare order-of-magnitude only. If our prediction is within 2× of the published number, the benchmark is informative; if not, treat as a known-unknown.

Long-context (> 8K) accuracy degradation

Roadmap

Beyond 8K context the KV-budget cap kicks in, the activation-memory heuristic underestimates real overheads, and the attention regime can flip from bandwidth- to compute-bound. PRD v2 acknowledges this as a parking-lot item.

Numerical impact

  • Predictions for context > 8K may underestimate latency by 30–60% depending on attention type.
  • Memory fit estimates for long contexts are conservative — calculator may report 'doesn't fit' on workloads engines actually serve via paged attention.

When to trust

  • Stay within 8K context for high-confidence predictions.
  • For long-context workloads, treat decode TPS as an upper bound and TTFT as a lower bound.

Apple Silicon MFU calibration

Known

Metal-MLX and Metal-llama.cpp presets are educated guesses pieced together from community measurements, not vendor datasheets. Apple doesn't publish kernel-level efficiency for the M-series the way NVIDIA does for H100/H200, so the MFU and bandwidth-efficiency figures in our presets carry meaningful uncertainty.

Numerical impact

  • Expect ±25–50% accuracy on M-series predictions, larger than the high-confidence ±30% target.
  • The error is approximately symmetric — sometimes faster, sometimes slower than predicted.

When to trust

  • Use the prediction to compare two M-series configurations relatively (M3 Max vs M4 Max). Absolute numbers should be cross-checked against r/LocalLLaMA threads.

GGUF formats on datacenter engines

Known

GGUF quantisation formats (Q4_K_M, Q5_K_S, etc.) are llama.cpp-shaped — designed for CPU+CUDA mixed execution with specific block layouts. Datacenter engines (vLLM, TRT-LLM) use native FP8/INT8/INT4 kernels with completely different memory access patterns. Pairing a GGUF quant with a datacenter engine produces a prediction that has no real-world counterpart.

Numerical impact

  • Predictions with mismatched quant/engine pairings are not validated; we publish a low-confidence badge and recommend swapping to a native quant.

When to trust

  • Don't. If the calculator surfaces this combination, switch the engine to llama.cpp/MLX or switch the quant to a datacenter-native format (FP8, INT8, INT4 native).

Expert load imbalance assumption

Known

For MoE models we assume worst-case expert routing — top-k experts always active, every step. Real production deployments use load-balanced routing (token distillation, expert-parallel scheduling) that's more efficient than worst-case. Our prediction is therefore pessimistic for well-tuned MoE deployments.

Numerical impact

  • MoE decode TPS predictions can underestimate throughput by 10–30% for production deployments with mature load balancing.

When to trust

  • Treat MoE decode TPS as a lower bound. Real deployments with kernel-level expert parallelism see better numbers.

Cache-pressure penalty under TP/PP

Roadmap

The cache-pressure-penalty formula applies a logical cluster-utilisation factor uniformly across shards. Per-shard utilisation may differ from the cluster average, especially under uneven TP/PP partitioning. The current implementation is conservative — it errs toward predicting more pressure than real deployments see.

Numerical impact

  • TP=8 / PP=2 deployments may see 5–15% better real throughput than the calculator predicts when the partition is balanced.

When to trust

  • For balanced TP topologies (TP ∈ {2, 4, 8} on uniform GPUs), treat the prediction as conservative.
  • For imbalanced setups (e.g. TP=3, PP=2), the penalty model is harder to reason about — flagged for revisit.

Methodology in 60 seconds

The four formulas the entire site is built from. Each line links to the corresponding source function in the library.

Memory
weights + KV·max_seq·concurrent + activation_overhead ≤ vram
Prefill TPS
(peak_tflops · mfu_prefill) / flops_per_token
Decode TPS
bandwidth_eff · hbm_bandwidth / streamed_bytes_per_step
Cost
(in/prefill_tps + out/decode_tps_aggregate) · gpu_$/s

Reproduce this yourself

Three commands regenerate the table from scratch against your local environment. Most calculators don't ship their validation harness; this one does.

01git clone https://github.com/agenticloops/tokenomy
02cd tokenomy && pip install -e ".[dev]"
03python scripts/validate.py
Want to enforce ±30% in CI?pytest -m slow

The data lives in lib/src/tokenomy/data/benchmarks/public_benchmarks.json — editable, PR welcome.

What this isn't, and where to go for it

We treat these as ground truth when adding new entries to the benchmark table.

Live benchmark measurements on real endpoints

When you need the empirical number, not the prediction.

Vendor-measured optimal configurations

Vendor-tuned numbers; treat as upper-bound goalposts.

Community measurements on consumer GPUs

Where r/LocalLLaMA and Hardware Corner go before vendor data exists.

Found a discrepancy? Tell us.

A row that's wildly off, a missing hardware preset, a benchmark we should track. PRs and issues are welcome.