Validation

Predicted vs measured throughput on public benchmarks. Sources are clickable on every row; known failure modes are named below.

Coverage at a glance

Loading coverage…

Confidence levels

Every confidence badge on /calculate and /compare links here. The level now reflects direct benchmark support, same-regime support, or lack of support, not generic hand-wavy heuristics.

Level	Configuration profile	Target	Failure modes
High	Dense model · NVIDIA datacenter GPU · FP16/BF16/FP8/INT8 weights · vLLM or TRT-LLM · batch ≥ 8	±30%	None known in this regime
Medium	GGUF on llama.cpp · Apple Silicon · MoE batch ≥ 16 · hybrid attention · TP=8 · context ≤ 8K	±50%	MFU calibration · expert load imbalance
Low	MoE batch < 8 · GGUF on TRT-LLM · context > 8K · MFU/bw_eff overridden far from preset	None	See known failure modes

Benchmarks

Loading benchmarks…

Known failure modes

Where the math breaks, named. Each card explains the regime, the numerical impact, and how to read predictions in that zone. Out-of-target ⚠ icons in the table above link directly to the matching card.

MoE small-batch decode

Partially mitigated

Our bandwidth-formula prediction overestimates throughput at low concurrency for MoE models because expert-routing dispatch and kernel-launch overhead aren't modelled. The simulator assumes weights stream at full HBM bandwidth on every step; in practice MoE engines spend much of every step bouncing tokens between experts.

Numerical impact

DeepSeek-V3 single-user: simulator predicts ~46k tok/s aggregate, measured ~620 tok/s — about 75× overshoot.
PRD v2's two-component decode formula reduces the gap (single-user: v1 37× off → v2 ~11× off).

When to trust

Treat the prediction as an upper bound until concurrency reaches the engine's batch-fill threshold (vLLM: ~16; TRT-LLM: ~8).
At batch ≥ 16 the formula converges back into the ±50% medium-confidence band.

PRD v2 §3

Underspecified vendor benchmarks

Known

Some vendor blogs publish a tok/s number without ISL, OSL, or concurrency, leaving the prediction-vs-measured comparison ambiguous. We estimate the missing fields from context, then flag the row's confidence as low. Treat these as sanity bounds, not precise targets.

Numerical impact

vllm-llama3.1-8b-bf16-h100-chat: ISL/OSL guessed from 'chat workload' wording; predicted/measured gap ~45%.

When to trust

Compare order-of-magnitude only. If our prediction is within 2× of the published number, the benchmark is informative; if not, treat as a known-unknown.

Long-context (> 8K) accuracy degradation

Roadmap

Beyond 8K context the KV-budget cap kicks in, the activation-memory heuristic underestimates real overheads, and the attention regime can flip from bandwidth- to compute-bound. PRD v2 acknowledges this as a parking-lot item.

Numerical impact

Predictions for context > 8K may underestimate latency by 30–60% depending on attention type.
Memory fit estimates for long contexts are conservative — calculator may report 'doesn't fit' on workloads engines actually serve via paged attention.

When to trust

Stay within 8K context for high-confidence predictions.
For long-context workloads, treat decode TPS as an upper bound and TTFT as a lower bound.

Apple Silicon MFU calibration

Known

Metal-MLX and Metal-llama.cpp presets are educated guesses pieced together from community measurements, not vendor datasheets. Apple doesn't publish kernel-level efficiency for the M-series the way NVIDIA does for H100/H200, so the MFU and bandwidth-efficiency figures in our presets carry meaningful uncertainty.

Numerical impact

Expect ±25–50% accuracy on M-series predictions, larger than the high-confidence ±30% target.
The error is approximately symmetric — sometimes faster, sometimes slower than predicted.

When to trust

Use the prediction to compare two M-series configurations relatively (M3 Max vs M4 Max). Absolute numbers should be cross-checked against r/LocalLLaMA threads.

GGUF formats on datacenter engines

Known

GGUF quantisation formats (Q4_K_M, Q5_K_S, etc.) are llama.cpp-shaped — designed for CPU+CUDA mixed execution with specific block layouts. Datacenter engines (vLLM, TRT-LLM) use native FP8/INT8/INT4 kernels with completely different memory access patterns. Pairing a GGUF quant with a datacenter engine produces a prediction that has no real-world counterpart.

Numerical impact

Predictions with mismatched quant/engine pairings are not validated; we publish a low-confidence badge and recommend swapping to a native quant.

When to trust

Don't. If the calculator surfaces this combination, switch the engine to llama.cpp/MLX or switch the quant to a datacenter-native format (FP8, INT8, INT4 native).

Expert load imbalance assumption

Known

For MoE models we assume worst-case expert routing — top-k experts always active, every step. Real production deployments use load-balanced routing (token distillation, expert-parallel scheduling) that's more efficient than worst-case. Our prediction is therefore pessimistic for well-tuned MoE deployments.

Numerical impact

MoE decode TPS predictions can underestimate throughput by 10–30% for production deployments with mature load balancing.

When to trust

Treat MoE decode TPS as a lower bound. Real deployments with kernel-level expert parallelism see better numbers.

Cache-pressure penalty under TP/PP

Roadmap

The cache-pressure-penalty formula applies a logical cluster-utilisation factor uniformly across shards. Per-shard utilisation may differ from the cluster average, especially under uneven TP/PP partitioning. The current implementation is conservative — it errs toward predicting more pressure than real deployments see.

Numerical impact

TP=8 / PP=2 deployments may see 5–15% better real throughput than the calculator predicts when the partition is balanced.

When to trust

For balanced TP topologies (TP ∈ {2, 4, 8} on uniform GPUs), treat the prediction as conservative.
For imbalanced setups (e.g. TP=3, PP=2), the penalty model is harder to reason about — flagged for revisit.

Methodology in 60 seconds

The four formulas the entire site is built from. Each line links to the corresponding source function in the library.

Memory

weights + KV·max_seq·concurrent + activation_overhead ≤ vram

source

Prefill TPS

(peak_tflops · mfu_prefill) / flops_per_token

source

Decode TPS

bandwidth_eff · hbm_bandwidth / streamed_bytes_per_step

source

Cost

(in/prefill_tps + out/decode_tps_aggregate) · gpu_$/s

source

Reproduce this yourself

Three commands regenerate the table from scratch against your local environment. Most calculators don't ship their validation harness; this one does.

01git clone https://github.com/agenticloops/tokenomy

02cd tokenomy && pip install -e ".[dev]"

03python scripts/validate.py

Want to enforce ±30% in CI?pytest -m slow

The data lives in lib/src/tokenomy/data/benchmarks/public_benchmarks.json — editable, PR welcome.

What this isn't, and where to go for it

We treat these as ground truth when adding new entries to the benchmark table.

Live benchmark measurements on real endpoints

When you need the empirical number, not the prediction.

Artificial Analysis
Empirical leaderboard across 500+ endpoints
llmfit
Analytical + live endpoint mode

Vendor-measured optimal configurations

Vendor-tuned numbers; treat as upper-bound goalposts.

NVIDIA TRT-LLM perf overview
Official performance tables for TRT-LLM
Microsoft Azure vLLM blog
Llama-3.1 8B vLLM measurements across GPU SKUs
vLLM benchmarks
Continuous-batching blog series

Community measurements on consumer GPUs

Where r/LocalLLaMA and Hardware Corner go before vendor data exists.

r/LocalLLaMA
Q4_K_M / Q5_K_M tok/s threads
Hardware Corner
RTX 4090 / 5090 measurements

Found a discrepancy? Tell us.

A row that's wildly off, a missing hardware preset, a benchmark we should track. PRs and issues are welcome.

Report a discrepancy Submit a benchmark