MoE small-batch decode
Partially mitigatedOur bandwidth-formula prediction overestimates throughput at low concurrency for MoE models because expert-routing dispatch and kernel-launch overhead aren't modelled. The simulator assumes weights stream at full HBM bandwidth on every step; in practice MoE engines spend much of every step bouncing tokens between experts.
Numerical impact
- DeepSeek-V3 single-user: simulator predicts ~46k tok/s aggregate, measured ~620 tok/s — about 75× overshoot.
- PRD v2's two-component decode formula reduces the gap (single-user: v1 37× off → v2 ~11× off).
When to trust
- Treat the prediction as an upper bound until concurrency reaches the engine's batch-fill threshold (vLLM: ~16; TRT-LLM: ~8).
- At batch ≥ 16 the formula converges back into the ±50% medium-confidence band.