Stage 4.3

System levers — batching, parallelism, engines

Stage 4 hub. Stages 4.1 and 4.2 are recommended.

Memory and compute levers live in the model. System levers live in the serving stack. They raise the utilisation of the model you’ve already chosen — the same Llama-3-70B FP8 weights served by vLLM, TRT-LLM, or a naive PyTorch loop will deliver 2–4× different aggregate TPS at the same per-user latency. Three system choices matter: scheduling (batching), memory management (paged attention, prefix caching), and parallelism topology.

01 Continuous batching: amortise weights across users. §

Stage 1 explained that decode reads the full weight set for every token. Batching helps because that one read serves multiple users simultaneously: stream the weights once, decode N tokens across N concurrent streams. Aggregate TPS scales with batch up to a compute or capacity ceiling; per-user TPS stays roughly flat.

Llama-3-70B FP8 · H100 SXM · sweep batch size · 2K tokens per request

batch 1

0 tok/s aggregate

0.0 tok/s per user

batch 4

0 tok/s aggregate

0.0 tok/s per user

batch 8

0 tok/s aggregate

0.0 tok/s per user

batch 16

0 tok/s aggregate

0.0 tok/s per user

batch 32

0 tok/s aggregate

0.0 tok/s per user

batch 64

0 tok/s aggregate

0.0 tok/s per user

aggregate TPS rises with batch · per-user TPS stays roughly flat in the bandwidth regime · the cliff is when KV cache fills VRAM

Static batching used to mean “wait for K requests, batch them together, run until they all finish, return”. Modern engines do continuous batching (also called dynamic or rolling batching): every decode step rebuilds the batch, picking up new arrivals as old completions finish. The result is much higher GPU utilisation under realistic mixed-length traffic — a 2–4× aggregate-TPS win versus static batching, for free.

vLLM’s Orca-style scheduler made this the de-facto default in 2023; everyone copied it. If you find a serving stack that doesn’t do continuous batching today, that’s the highest-impact upgrade to make.

Continuous batching is the largest free win available. 2–4× aggregate TPS over static batching at no quality or memory cost. It’s the floor; everything below this is a missing feature.

02 Paged attention and prefix caching. §

The KV cache is the second source of utilisation loss. Naive allocation gives each request a contiguous KV slab sized for max_context_length, even if the actual conversation is much shorter. The fragmentation tax is large at scale.

Paged attention (vLLM’s contribution) splits the KV cache into fixed-size pages (block size 16 or 32) and allocates them lazily per token. Effective concurrency rises 20–40% under realistic traffic because the headroom that used to be reserved for the worst-case-length request is now available for actual users. Most engines have adopted this approach by 2026.

Prefix caching is the second memory-management win. When many requests share a long system prompt — RAG applications, agent loops, long-system-prompt assistants — recomputing the prefill for that prefix on every request is wasteful. Prefix caching detects the shared prefix, reuses the already-computed KV vectors for it, and only runs prefill on the per-request continuation. Production wins are 30–80% prefill cost reduction in agent workloads, where the system prompt might be 8K and each user contribution might be 500 tokens.

Paged attention removes fragmentation; prefix caching reuses repeated prefills. Both are engine features; both are essentially free if you’re already on vLLM, TRT-LLM, SGLang, or similar.

03 Parallelism: how to host a model that doesn't fit. §

When the model doesn’t fit on one GPU, four classical answers: Tensor parallel (TP), pipeline parallel (PP), data parallel (DP), and expert parallel (EP, MoE-only).

Llama-3-405B FP8 · 8 concurrent · vary TP topology

Topology	Fits	Decode (per user)	Max concurrent	$/M output
1× H100 (single) fits if weights ≤ 80 GB	—	—	—	—
2× H100 (TP) split per layer	—	—	—	—
4× H100 (TP) split per layer	—	—	—	—
8× H100 (TP) split per layer	—	—	—	—

TP fits a 405B model on multiple GPUs · decode TPS does not scale linearly because comms steals from compute · concurrent capacity rises because there's more total VRAM

For dense inference, TP is the only one that matters in practice. TP splits each weight matrix across GPUs so every layer runs in parallel, with an all-reduce after the matmul. Per-GPU memory drops by N. Decode TPS does not rise by N because all-reduce comms steal cycles from compute — typical TP-2 → TP-4 → TP-8 sees decode TPS land at roughly 1.6× / 2.5× / 4× the TP-1 number, not 2× / 4× / 8×.

PP exists but is rarely first-choice for inference: the pipeline-bubble adds latency, and the critical-path of one decode step still sees the full model. Inference deployments use PP only when TP would over-shard the hidden dimension (e.g., for very long, very thin models). DP is mostly a training concept; for inference, “DP” is usually realised as multiple replicas of the deployment behind a load balancer, not a parallelism strategy inside one engine.

EP is MoE-specific (covered in Stage 4.2). The takeaway: for dense models, TP is the answer — at the smallest N that fits the model, leaving DP-style replication outside the engine for horizontal scale.

TP at the minimum N that fits, replicas for horizontal scale. PP only in rare model geometries. EP only for MoE in NVLink fabrics. The decision tree is short on purpose.

04 The serving engine landscape. §

Three engines hold most of the production market in 2026: vLLM, NVIDIA TRT-LLM, and SGLang.

vLLM is the open-source default: paged attention, continuous batching, prefix caching, speculative decoding, full Hugging Face integration. Best general-purpose choice; widest hardware support; biggest community.
TRT-LLM is NVIDIA’s stack: peak per-GPU performance on H100/H200/B200, FP8 native, excellent for production where the SLO demands the last 10–20% of the bandwidth equation. NVIDIA-only; the tightest integration of the three.
SGLang is a research-driven engine focused on structured generation, agent workloads, and very high throughput for prefix-heavy traffic. Aggressive prefix caching, RadixAttention for shared-prefix groups; the right pick when you’re running agent/RAG workloads.

The choice is rarely about raw decode TPS — all three are within ~15–25% of each other at peak. It’s about (a) hardware coverage, (b) workload fit (agent vs chat vs RAG), and (c) operational tradeoffs (open-source ecosystem vs vendor support).

Three engines, all production-ready. Pick by workload shape and operational fit, not by benchmark headlines. The headlines are usually within noise.

05 Folklore the equation refuses. §

Open Llama-3-70B FP8 / 32 users on H100 in the calculator ↗

What's next

Three sub-pages of toolbox; one stage left. Stage 5 walks the full design loop: pick the binding constraint, pick the model, pick the hardware, pick the engine, validate the prediction, ship.