Stage 4.3
System levers — batching, parallelism, engines
Stage 4 hub. Stages 4.1 and 4.2 are recommended.
Memory and compute levers live in the model. System levers live in the serving stack. They raise the utilisation of the model you’ve already chosen — the same Llama-3-70B FP8 weights served by vLLM, TRT-LLM, or a naive PyTorch loop will deliver 2–4× different aggregate TPS at the same per-user latency. Three system choices matter: scheduling (batching), memory management (paged attention, prefix caching), and parallelism topology.
01 Continuous batching: amortise weights across users. §
Stage 1 explained that decode reads the full weight set for every token. Batching helps because that one read serves multiple users simultaneously: stream the weights once, decode N tokens across N concurrent streams. Aggregate TPS scales with batch up to a compute or capacity ceiling; per-user TPS stays roughly flat.
aggregate TPS rises with batch · per-user TPS stays roughly flat in the bandwidth regime · the cliff is when KV cache fills VRAM
Static batching used to mean “wait for K requests, batch them together, run until they all
finish, return”. Modern engines do
vLLM’s Orca-style scheduler made this the de-facto default in 2023; everyone copied it. If you find a serving stack that doesn’t do continuous batching today, that’s the highest-impact upgrade to make.
Continuous batching is the largest free win available. 2–4× aggregate TPS over static batching at no quality or memory cost. It’s the floor; everything below this is a missing feature.
02 Paged attention and prefix caching. §
The KV cache is the second source of utilisation loss. Naive allocation gives each request a
contiguous KV slab sized for max_context_length, even if the actual conversation is much
shorter. The fragmentation tax is large at scale.
Paged attention removes fragmentation; prefix caching reuses repeated prefills. Both are engine features; both are essentially free if you’re already on vLLM, TRT-LLM, SGLang, or similar.
03 Parallelism: how to host a model that doesn't fit. §
When the model doesn’t fit on one GPU, four classical answers:
| Topology | Fits | Decode (per user) | Max concurrent | $/M output |
|---|---|---|---|---|
1× H100 (single) fits if weights ≤ 80 GB | — | — | — | — |
2× H100 (TP) split per layer | — | — | — | — |
4× H100 (TP) split per layer | — | — | — | — |
8× H100 (TP) split per layer | — | — | — | — |
TP fits a 405B model on multiple GPUs · decode TPS does not scale linearly because comms steals from compute · concurrent capacity rises because there's more total VRAM
For dense inference, TP is the only one that matters in practice. TP splits each weight matrix across GPUs so every layer runs in parallel, with an all-reduce after the matmul. Per-GPU memory drops by N. Decode TPS does not rise by N because all-reduce comms steal cycles from compute — typical TP-2 → TP-4 → TP-8 sees decode TPS land at roughly 1.6× / 2.5× / 4× the TP-1 number, not 2× / 4× / 8×.
PP exists but is rarely first-choice for inference: the pipeline-bubble adds latency, and the critical-path of one decode step still sees the full model. Inference deployments use PP only when TP would over-shard the hidden dimension (e.g., for very long, very thin models). DP is mostly a training concept; for inference, “DP” is usually realised as multiple replicas of the deployment behind a load balancer, not a parallelism strategy inside one engine.
EP is MoE-specific (covered in Stage 4.2). The takeaway: for dense models, TP is the answer — at the smallest N that fits the model, leaving DP-style replication outside the engine for horizontal scale.
TP at the minimum N that fits, replicas for horizontal scale. PP only in rare model geometries. EP only for MoE in NVLink fabrics. The decision tree is short on purpose.
04 The serving engine landscape. §
Three engines hold most of the production market in 2026: vLLM, NVIDIA TRT-LLM, and SGLang.
- vLLM is the open-source default: paged attention, continuous batching, prefix caching, speculative decoding, full Hugging Face integration. Best general-purpose choice; widest hardware support; biggest community.
- TRT-LLM is NVIDIA’s stack: peak per-GPU performance on H100/H200/B200, FP8 native, excellent for production where the SLO demands the last 10–20% of the bandwidth equation. NVIDIA-only; the tightest integration of the three.
- SGLang is a research-driven engine focused on structured generation, agent workloads, and very high throughput for prefix-heavy traffic. Aggressive prefix caching, RadixAttention for shared-prefix groups; the right pick when you’re running agent/RAG workloads.
The choice is rarely about raw decode TPS — all three are within ~15–25% of each other at peak. It’s about (a) hardware coverage, (b) workload fit (agent vs chat vs RAG), and (c) operational tradeoffs (open-source ecosystem vs vendor support).
Three engines, all production-ready. Pick by workload shape and operational fit, not by benchmark headlines. The headlines are usually within noise.
05 Folklore the equation refuses. §
What's next
Three sub-pages of toolbox; one stage left. Stage 5 walks the full design loop: pick the binding constraint, pick the model, pick the hardware, pick the engine, validate the prediction, ship.