MAXFLOW

"The maximum flow through a network equals the capacity of its minimum cut."

Ford & Fulkerson, 1956

tok/s ≈ bandwidth / model_bytes
Finding the max flow IS finding the bottleneck. GPU marketing says TFLOPS. Physics says bandwidth.

HYPERION

CPUi9-13900K — 24 cores
RAM192 GB DDR5 — 90 GB/s (2-ch)
GPURTX 5090 — 32 GB GDDR7
VRAM1,792 GB/s
PCIe5.0 ×16 — 64 GB/s
SSDGen 4 — 7 GB/s

WARSHIP (target build)

CPUTR PRO 9975WX — 32c / 128 lanes
RAM256 GB DDR5 ECC — 358 GB/s (8-ch)
GPUs2× RTX PRO 6000 96 GB + 5090 32 GB
VRAM224 GB total — ~1,536 GB/s ea.
PCIe5.0 ×16 per GPU — 64 GB/s ea.
SSDGen 5 — 14 GB/s
Effective Read Bandwidth (GB/s) 1,792 1,536 64 14 7 Model Size (GB, log scale) 32 96 192 224 VRAM 1,792 GB/s VRAM ~1,536 GB/s PCIe: 64 GB/s SSD: 7 GB/s SSD: 14 GB/s 28× bandwidth cliff upgrade zone 8B 16 GB 70B Q4 35 GB 70B FP16 140 GB 405B Q4 203 GB 112 t/s 15 t/s 44 t/s 2.9× 0.6 t/s 22 t/s 37× 0.4 t/s 8 t/s Hyperion (5090) Warship (3-GPU)

Inference: Tokens Per Second

Model Weights Hyperion Warship Bottleneck
Llama 8B FP16 16 GB 112 t/s 112 t/s VRAM bandwidth (both)
Llama 70B Q4 35 GB 15 t/s 44 t/s PCIe offload → all-VRAM
Llama 70B FP16 140 GB 0.6 t/s 22 t/s Massive offload → 2-GPU TP
Llama 405B Q4 203 GB 0.4 t/s 8 t/s Crawling offload → 3-GPU PP
Theoretical maximums: tok/s = bandwidth / model_bytes. Real-world ~20-40% less (KV cache, attention, overhead). Ratios hold.

Speculative Decoding = Augmenting Path

In Ford-Fulkerson, you increase flow by finding augmenting paths. SSD (Speculative Speculative Decoding, Tri Dao / ICLR 2026) adds a parallel path:

Without SSD

Target model generates one token at a time.

70B FP16 → 22 t/s

With SSD

Draft (8B on 5090) proposes N tokens.
Target (70B on 2× PRO 6000) verifies batch.
Accept ~3-5 tokens per draft step.

70B quality → 50-70 t/s (est.)

The 5090 stays in the Warship as the draft GPU. It adds an augmenting path through the flow network: cheap candidates verified in bulk.

Agent Training: The Memory Multiplier

Training needs more than weights. Adam optimizer ≈ 3× model in FP32. Gradients ≈ 1×.

Task Memory Hyperion Warship
8B full fine-tune ~80 GB Offload optimizer to RAM 1 GPU
70B QLoRA ~40 GB Tight (32 GB VRAM) 1 PRO 6000
8B PPO (agent RL) ~160 GB 2× PRO 6000 TP
70B full fine-tune ~560 GB Cloud only

Agent training (PPO/GRPO) needs policy + reference + value + reward models simultaneously. For 8B agents that's ~4× model memory. Warship fits it. Hyperion can't.

The Min-Cut Migrates

Below the cliff

Model exceeds VRAM. Weights spill to RAM, read through PCIe at 64 GB/s. That's 28× slower than VRAM. Every GB past the cliff costs dearly.

Above the cliff

Model fits in VRAM. All weights read at 1,500+ GB/s. The bottleneck shifts from capacity to bandwidth—a better problem. Adding a second GPU doubles throughput via tensor parallelism.

The staircase chart is the whole story. Hyperion's cliff is at 32 GB. Warship's is at 224 GB. Every model between 32 and 224 GB goes from PCIe-bound to VRAM-bound—an average 20× speedup.

As you scale further: Tier 3 (4-GPU, $55K) pushes the cliff to 416 GB. Tier 4 (DGX, $300K) reaches 640 GB with NVSwitch mesh. The bottleneck migrates: VRAM capacity → VRAM bandwidth → inter-GPU sync → money. MAXFLOW finds the cut at every tier.

THE GAME: Build Your Inference Rig

MAXFLOW PC Builder — bandwidth staircase with component cards

You have a budget. You have a workload. Build the machine.

The staircase updates as you pick components. Every choice shifts the cliff—the point where your model overflows VRAM and performance drops 28×. Push the cliff right. Keep the staircase high. The max-flow determines your score.

Scoring Rubric

Flow Score

primary

Largest model at ≥20 tok/s. 8B = entry. 70B-Q4 = competitive. 70B-FP16 = strong. 405B = elite.

Min-Cut

diagnostic

Your bottleneck component. What to upgrade next: VRAM capacity, VRAM bandwidth, PCIe, lanes, or budget.

Efficiency

tok/s per $1K

Performance per dollar. Leaderboard: Budget (<$3K), Prosumer ($3-10K), Workstation ($10-40K), Research ($40K+).

SSD Bonus

×1.0–3.0

Draft GPU + target GPU = speculative decoding. Augmenting path multiplier on effective tok/s.

Component Deck

Pick from these cards. Each shifts the staircase differently.

GPUs — the main pipe

CardVRAMBandwidthTDPPrice
RTX 4060 Ti16 GB288 GB/s165W$400
RTX 409024 GB1,008 GB/s450W$1,600
RTX 509032 GB1,792 GB/s575W$2,000
RTX PRO 600096 GB~1,536 GB/s300-600W$9,449
H100 SXM80 GB3,350 GB/s700W$25,000

CPUs — the lanes

CardCoresPCIe LanesRAM Ch.Price
i9-14900K2420 (5.0)2$580
Ryzen 9 9950X1628 (5.0)2$550
TR PRO 9975WX32128 (5.0)8$3,924
TR PRO 9995WX96128 (5.0)8$7,469

RAM — the overflow pool

ConfigCapacityBandwidthPrice
DDR5-5600 2-ch192 GB90 GB/s$600
DDR5-5600 ECC 8-ch256 GB358 GB/s$5,600
DDR5-5600 ECC 8-ch512 GB358 GB/s$11,200

Challenge Scenarios

The Freelancer

Budget: $3,000

Run 70B Q4 at >10 tok/s. You have $3K. Go.

The Quant

Budget: $15,000

Run 70B FP16 inference + QLoRA 70B training. Two missions, one machine.

The Lab

Budget: $50,000

Run 405B Q4 at >15 tok/s AND train 8B agents with PPO.

The Endgame

Budget: unlimited

Run 405B FP16 at >20 tok/s. Minimize cost. This is the final boss.

The Staircase Mechanic

Every component shifts the staircase. The game is: make it tall and wide within budget.

ActionEffect on Staircase
Add GPUExtends top step right (more VRAM → cliff moves)
Upgrade GPURaises top step (faster bandwidth per GPU)
Add RAMExtends middle step right (more offload capacity)
Upgrade CPUMore PCIe lanes = feed more GPUs at full speed
Better SSDBarely moves anything (startup speed only)

Training Badges

QLoRA Novice

QLoRA 8B

~12 GB VRAM. Any modern GPU.

QLoRA Pro

QLoRA 70B

~40 GB. Needs PRO 6000 or 2× consumer GPUs.

Agent Smith

8B PPO

~160 GB. Policy + ref + value + reward. 2× PRO 6000.

Cloud Cutter

70B full FT

~560 GB. Nobody does this locally. Yet.