Stage 5

Given my SLO and budget, what should I run?

Builds on Stages 1–4. Bring the bandwidth and capacity equations and the toolbox decision tree.

Earlier stages built the law. This stage uses it. We’ll pick a real-shape product spec — the kind that lands in a planning doc — and walk to a defensible deployment with the same library that powers the calculator. Every decision is one line of arithmetic. None of it requires measuring the system in production first.

01 Designing is naming the binding constraint. §

Every inference deployment has three resources that can run out: bandwidth (per-user latency), capacity (concurrent users at a given context), and budget (dollars per million tokens). The deployment that ships is the one where all three constraints are met simultaneously. The problem is that the three pull against each other. A larger model raises capability and bandwidth cost; a longer context raises capacity cost; a tighter budget caps both.

The mental move is to pick the constraint that’s most likely to bind first — the one most sensitive to the workload — and design backwards from there. For chat workloads it’s usually bandwidth (latency). For agents and RAG it’s usually capacity (long contexts, many users). For batch inference it’s almost always cost ($/M).

A rule that has held up well: if the SLO is p95 first-token-time, the binding constraint is prefill compute. If it’s p95 inter-token-latency, it’s decode bandwidth. If it’s concurrent users at long context, it’s capacity. If it’s $/M output token, it’s all three at once and you need the calculator.

Name the binding constraint first. Every later choice — model, precision, GPU, topology, engine — is a vote about that constraint, not an independent decision.

02 A worked spec: a coding agent product. §

Concrete scenario.

“We’re shipping a coding agent. Each user session has a 6K-token system prompt (tools, code conventions, recent context), a 1K-token user request, and produces 1K tokens of output on average. We need to serve 16 concurrent sessions per GPU node at peak. The product needs to match GPT-4o on quality. Inter-token latency p95 should be under 50ms. Hardware budget is ~$30/hour per node.”

Workload shape: 7K input + 1K output ≈ 8K total context, agent-mode (system-prompt-heavy). Concurrency: 16 streams. Latency: ITL ≤ 50ms means decode ≥ 20 tok/s per user. Quality: “GPT-4o-equivalent” — i.e., a 70B-class dense model or a strong MoE like DeepSeek-V3.

The latency budget is the first cut: 20 tok/s per user at 70B FP16 demands an H100-class GPU at minimum. The capacity budget is the second cut: 16 streams × 8K context × FP16 KV is around 40GB just for KV cache, on top of weights. The budget is the third cut: at $30/hr that’s a single H100 SXM (which fits at FP8) or two H100s (which fits at FP16 with TP, but eats most of the hourly budget).

That’s three constraints binding hard. Now the toolbox.

Workload shape: 8K context, 16 concurrent, 20+ tok/s decode, ~$30/hr. Three constraints binding hard; choices are scarce.

03 Pick the model: capability, then math. §

Quality match comes first. Independent benchmarks place Llama-3-70B and DeepSeek-V3 within a few percentage points of GPT-4o on most public evals; Qwen-2.5-72B is in the same band. Below that — Mixtral-8×22B, Mistral-Large, Llama-3-8B — you start to see meaningful regressions on reasoning tasks.

Two real candidates: Llama-3-70B (dense, well-understood) and DeepSeek-V3 (MoE, more efficient per active param). Llama-3-70B is the reference dense answer. DeepSeek-V3 has stronger reasoning on the public benchmarks and per-active-param decode efficiency, but it’s an 8-GPU deployment minimum — outside the single-node budget here.

For the spec as written, Llama-3-70B is the lead candidate. The MoE alternative is worth considering only if the budget loosens to two nodes (reasonable for an 8× H100 SXM at $25/hr spot pricing).

Pick the smallest model that meets the quality bar, not the largest you can fit. The capability ceiling and the deployment budget rarely point at the same answer; quality wins the tie because customers can feel quality more than they can feel TPS.

04 Pick the precision and topology. §

With Llama-3-70B chosen, the next question is: FP16 or FP8 weights? FP16 needs 140 GB of weight memory — that’s two H100 SXM with TP-2 just to fit. FP8 halves the weights to ~70 GB, which fits on a single H100 SXM with ~5 GB of headroom for KV cache and activation overhead.

The decode equation says FP8 doubles per-user TPS over FP16 at the same hardware (because bytes-per-param halves). The quality regression for FP8 weights on Llama-3-70B is statistically zero on any public benchmark. So FP8 weights are the right starting point. The remaining question is whether one H100 is enough or you need TP-2.

Model
GPU
Context
Concurrent
Total VRAM
0 GB
Max concurrent at this context
0
Status
OOM ✗

free_vram = vram − weights − overhead − (kv_per_token · context · concurrent)

The widget is loaded with this scenario. Move the concurrent slider to 16 and the context to 8K — you’ll see ~5 GB of free VRAM after weights and overhead, and a “max concurrent at this context” that lands at ~3, not 16. Single-GPU FP8 fits the model, but capacity binds well before the spec’s concurrency target.

FP8 weights are the precision default — free quality, double the per-user TPS. Whether one H100 is enough depends on capacity, not bandwidth, and Beat 5 below shows the math.

05 Validate the capacity headroom. §

The spec asks for 16 concurrent at 8K context. The capacity equation says max concurrent is roughly free_vram / (kv_per_token × context). At 8K context with FP8 KV on Llama-3-70B (kv_per_token ≈ 160 KB), one user takes ~1.3 GB of KV. With ~5 GB of free VRAM after weights and overhead on a single H100, that’s only ~3 users — capacity binds well before the spec’s 16 streams. Bandwidth is fine; capacity is the wall.

Llama-3-70B · FP8 weights · FP8 KV · H100 SXM 80 GB · max concurrent users by context length
2,000 ctx
no fit (KV > VRAM)
baseline
8,000 ctx
no fit (KV > VRAM)
off the cliff
32,000 ctx
no fit (KV > VRAM)
off the cliff
128,000 ctx
no fit (KV > VRAM)
off the cliff

same model · same GPU · only context length changed · capacity drops faster than linearly

Read the cliff at FP8 KV: 2K context fits a healthy double-digit number; 8K drops to a few users; 32K and above are off the cliff entirely on a single H100. Even if production traffic only ever hit 8K context, 1× H100 is short of the spec’s 16-stream target by 5×.

The fix is one of: (a) move to TP-2 (two H100s, weights split per layer; both VRAM and headroom roughly double per node, hourly doubles), (b) move to 1× H200 SXM (141 GB VRAM — ~76% bigger than H100, all in one card), or (c) truncate context lower than the spec asks (quality regression, last resort).

Capacity headroom is non-negotiable. The spec’s 16 streams at 8K does not fit on 1× H100, even at FP8/FP8. TP-2 or H200 are the two honest fixes; both keep FP8 throughout.

06 Pick the engine: workload shape decides. §

Three candidates: vLLM, TRT-LLM, SGLang. The workload here is agent-style: long shared system prompt, short per-request content, many concurrent sessions. The shared system prompt is the single biggest engine-level optimisation lever for this product, because the prefill cost on a 6K system prompt is non-trivial — roughly 50ms of prefill compute per request if it isn’t cached.

SGLang’s RadixAttention prefix caching is purpose-built for this case: a 6K system prompt that’s identical across all sessions becomes a one-time prefill cost rather than a per-request one. vLLM has prefix caching too (added in 2024); the SGLang implementation is more aggressive and handles the deeper prefix-tree case (per-tool-call prefixes, multi-turn agent state).

For a coding-agent product specifically, SGLang is the lead choice. For chat where prefixes aren’t shared, vLLM is the strong default. For raw NVIDIA-hardware peak performance, TRT-LLM edges either by 10–20% on H100 with FP8.

Engine choice is workload-driven. Agents → SGLang (prefix caching at scale). Chat → vLLM (best-rounded). NVIDIA-peak → TRT-LLM. Throughput differences without prefix caching are modest (~15–25%); throughput differences with prefix caching are 2–3× when the workload has shared prefixes.

07 Read the scorecard side by side. §

Three candidates: A (1× H100 FP8/FP8 — the budget option), B (2× H100 TP FP8/FP8 — the spec-fitting answer), C (8× H200 DeepSeek-V3 FP8/FP8 — the MoE upgrade). The scorecard shows the live calculator output for each.

Same workload (16 users × 8K context, agent-style traffic) · three deployments
A — Llama-3-70B FP8 / 1× H100 / FP8 KV
Fits
Decode (per user)0.0 tok/s
Max concurrent0
$/M output$0.00

Binding constraint
B — Llama-3-70B FP8 / 2× H100 TP / FP8 KV
Fits
Decode (per user)0.0 tok/s
Max concurrent0
$/M output$0.00

Binding constraint
C — DeepSeek-V3 FP8 / 8× H200 TP / FP8 KV
Fits
Decode (per user)0.0 tok/s
Max concurrent0
$/M output$0.00

Binding constraint

Reading across the rows: A fits the model but caps at ~3 concurrent — capacity-bound, and because most of the GPU sits idle on KV instead of decoding for many users, $/M output is higher than B despite the cheaper hourly. B spends 2× the hourly to split weights across two GPUs and recover the KV headroom, which lifts max concurrent into the dozens and drops $/M output below A’s because the GPU is now actually utilised. C swaps to a MoE model on H200; per-token economics are similar to B’s, with much more knowledge density and far higher ceiling on concurrent streams — the right call only if the product needs DeepSeek-V3’s capability or anticipates significant traffic.

The “binding constraint” row is the key signal. A is capacity-bound at the spec; B’s constraint shifts to bandwidth or budget, which is the regime you want to be in; C has headroom on every axis at higher hardware commitment.

The scorecard format is the deliverable: three columns, five rows of metrics, one line of binding-constraint commentary. The lesson the rows tell here: cheaper hourly does not mean cheaper per token when capacity is bound — utilisation wins.

08 When the prediction lies. §

The library and the calculator agree on the math. They predict to within 20–30% of measured throughput on the validated benchmark set (/validate). The 30% gap is real; it comes from five recurring sources, each documented as a failure mode:

  • MoE at small batch — efficiency loss from imbalanced expert routing; the calculator assumes even routing.
  • Underspecified vendor configurations — published “70B model on H100” numbers don’t always specify quant, batch size, or context, so head-to-head comparisons against vendor TPS often disagree by 30%+.
  • Long context regimes — attention compute grows quadratically once you’re past the KV cache fits-in-VRAM regime; the calculator’s linear approximation under-predicts step time at >32K context.
  • Apple Silicon MFU — the unified memory architecture makes the bandwidth equation less directly predictive.
  • GGUF on data-center cards — the GGUF format trades bandwidth for compute in ways the calculator doesn’t model precisely.

For the worked spec landed in Beat 5 — Llama-3-70B FP8 / 2× H100 TP / FP8 KV at 16 concurrent × 8K context — none of those five failure modes apply. The calculator’s prediction should land within ~20% of measured production TPS. If it doesn’t, the first thing to check is whether continuous batching is on in the engine and whether the engine is actually using FP8 (older vLLM versions silently fell back to FP16 if the GPU lacked FP8 hardware).

Trust the math at ±20% in the validated regimes; trust it less in the named failure-mode regimes; never trust it without sanity-checking against an actual measurement at deploy time. The calculator is a spec-sheet upper bound; reality lands inside it.

09 Two slogans the design loop refuses. §

10 What you've built. §

If you’ve worked through all five stages, you can:

  • Predict per-user decode TPS from a model + GPU + quant spec sheet, with no benchmarks required (Stage 2).
  • Predict the maximum concurrent users at a given context for any GPU + model combination, before deploying (Stage 3).
  • Pick the right lever — quant / KV quant / attention variant / MoE / batching / parallelism / engine — for any given binding constraint (Stage 4).
  • Walk a real product spec to a defensible deployment in five steps and produce a scorecard your platform team will accept (Stage 5).
  • Sanity-check the prediction against /validate and know which named failure modes apply to your deployment (cross-cutting, every stage).

The site keeps three pages live where this knowledge becomes operational:

  • /calculate — the cockpit. Plug in any spec; get the four numbers + confidence.
  • /compare — the laboratory. Sweep one variable; see the curve; share the URL.
  • /validate — the trust play. See where the predictions hold and where they don’t, named.

Together, they’re an interactive textbook that pays back in real planning artifacts.

The Learn-page spine ends at a deployment scorecard. The tools survive the read; come back with a real spec, plug it in, share the URL with your team. That’s the deliverable.

Open the worked-spec deployment in the calculator ↗

What's next

The Learn-page spine ends here. Open the calculator with a real product spec; share the URL with your platform team; come back to /validate when the prediction needs defending.

Anchor copied