"The maximum flow through a network equals the capacity of its minimum cut."
Ford & Fulkerson, 1956
| CPU | i9-13900K — 24 cores |
| RAM | 192 GB DDR5 — 90 GB/s (2-ch) |
| GPU | RTX 5090 — 32 GB GDDR7 |
| VRAM | 1,792 GB/s |
| PCIe | 5.0 ×16 — 64 GB/s |
| SSD | Gen 4 — 7 GB/s |
| CPU | TR PRO 9975WX — 32c / 128 lanes |
| RAM | 256 GB DDR5 ECC — 358 GB/s (8-ch) |
| GPUs | 2× RTX PRO 6000 96 GB + 5090 32 GB |
| VRAM | 224 GB total — ~1,536 GB/s ea. |
| PCIe | 5.0 ×16 per GPU — 64 GB/s ea. |
| SSD | Gen 5 — 14 GB/s |
| Model | Weights | Hyperion | Warship | Bottleneck |
|---|---|---|---|---|
| Llama 8B FP16 | 16 GB | 112 t/s | 112 t/s | VRAM bandwidth (both) |
| Llama 70B Q4 | 35 GB | 15 t/s | 44 t/s | PCIe offload → all-VRAM |
| Llama 70B FP16 | 140 GB | 0.6 t/s | 22 t/s | Massive offload → 2-GPU TP |
| Llama 405B Q4 | 203 GB | 0.4 t/s | 8 t/s | Crawling offload → 3-GPU PP |
| Theoretical maximums: tok/s = bandwidth / model_bytes. Real-world ~20-40% less (KV cache, attention, overhead). Ratios hold. | ||||
In Ford-Fulkerson, you increase flow by finding augmenting paths. SSD (Speculative Speculative Decoding, Tri Dao / ICLR 2026) adds a parallel path:
Target model generates one token at a time.
70B FP16 → 22 t/s
Draft (8B on 5090) proposes N tokens.
Target (70B on 2× PRO 6000) verifies batch.
Accept ~3-5 tokens per draft step.
70B quality → 50-70 t/s (est.)
The 5090 stays in the Warship as the draft GPU. It adds an augmenting path through the flow network: cheap candidates verified in bulk.
Training needs more than weights. Adam optimizer ≈ 3× model in FP32. Gradients ≈ 1×.
| Task | Memory | Hyperion | Warship |
|---|---|---|---|
| 8B full fine-tune | ~80 GB | Offload optimizer to RAM | 1 GPU |
| 70B QLoRA | ~40 GB | Tight (32 GB VRAM) | 1 PRO 6000 |
| 8B PPO (agent RL) | ~160 GB | — | 2× PRO 6000 TP |
| 70B full fine-tune | ~560 GB | — | Cloud only |
Agent training (PPO/GRPO) needs policy + reference + value + reward models simultaneously. For 8B agents that's ~4× model memory. Warship fits it. Hyperion can't.
Model exceeds VRAM. Weights spill to RAM, read through PCIe at 64 GB/s. That's 28× slower than VRAM. Every GB past the cliff costs dearly.
Model fits in VRAM. All weights read at 1,500+ GB/s. The bottleneck shifts from capacity to bandwidth—a better problem. Adding a second GPU doubles throughput via tensor parallelism.
The staircase chart is the whole story. Hyperion's cliff is at 32 GB. Warship's is at 224 GB. Every model between 32 and 224 GB goes from PCIe-bound to VRAM-bound—an average 20× speedup.
As you scale further: Tier 3 (4-GPU, $55K) pushes the cliff to 416 GB. Tier 4 (DGX, $300K) reaches 640 GB with NVSwitch mesh. The bottleneck migrates: VRAM capacity → VRAM bandwidth → inter-GPU sync → money. MAXFLOW finds the cut at every tier.
You have a budget. You have a workload. Build the machine.
The staircase updates as you pick components. Every choice shifts the cliff—the point where your model overflows VRAM and performance drops 28×. Push the cliff right. Keep the staircase high. The max-flow determines your score.
Largest model at ≥20 tok/s. 8B = entry. 70B-Q4 = competitive. 70B-FP16 = strong. 405B = elite.
Your bottleneck component. What to upgrade next: VRAM capacity, VRAM bandwidth, PCIe, lanes, or budget.
Performance per dollar. Leaderboard: Budget (<$3K), Prosumer ($3-10K), Workstation ($10-40K), Research ($40K+).
Draft GPU + target GPU = speculative decoding. Augmenting path multiplier on effective tok/s.
Pick from these cards. Each shifts the staircase differently.
| Card | VRAM | Bandwidth | TDP | Price |
|---|---|---|---|---|
| RTX 4060 Ti | 16 GB | 288 GB/s | 165W | $400 |
| RTX 4090 | 24 GB | 1,008 GB/s | 450W | $1,600 |
| RTX 5090 | 32 GB | 1,792 GB/s | 575W | $2,000 |
| RTX PRO 6000 | 96 GB | ~1,536 GB/s | 300-600W | $9,449 |
| H100 SXM | 80 GB | 3,350 GB/s | 700W | $25,000 |
| Card | Cores | PCIe Lanes | RAM Ch. | Price |
|---|---|---|---|---|
| i9-14900K | 24 | 20 (5.0) | 2 | $580 |
| Ryzen 9 9950X | 16 | 28 (5.0) | 2 | $550 |
| TR PRO 9975WX | 32 | 128 (5.0) | 8 | $3,924 |
| TR PRO 9995WX | 96 | 128 (5.0) | 8 | $7,469 |
| Config | Capacity | Bandwidth | Price |
|---|---|---|---|
| DDR5-5600 2-ch | 192 GB | 90 GB/s | $600 |
| DDR5-5600 ECC 8-ch | 256 GB | 358 GB/s | $5,600 |
| DDR5-5600 ECC 8-ch | 512 GB | 358 GB/s | $11,200 |
Run 70B Q4 at >10 tok/s. You have $3K. Go.
Run 70B FP16 inference + QLoRA 70B training. Two missions, one machine.
Run 405B Q4 at >15 tok/s AND train 8B agents with PPO.
Run 405B FP16 at >20 tok/s. Minimize cost. This is the final boss.
Every component shifts the staircase. The game is: make it tall and wide within budget.
| Action | Effect on Staircase |
|---|---|
| Add GPU | Extends top step right (more VRAM → cliff moves) |
| Upgrade GPU | Raises top step (faster bandwidth per GPU) |
| Add RAM | Extends middle step right (more offload capacity) |
| Upgrade CPU | More PCIe lanes = feed more GPUs at full speed |
| Better SSD | Barely moves anything (startup speed only) |
~12 GB VRAM. Any modern GPU.
~40 GB. Needs PRO 6000 or 2× consumer GPUs.
~160 GB. Policy + ref + value + reward. 2× PRO 6000.
~560 GB. Nobody does this locally. Yet.