MAXFLOW

Model	Weights	Hyperion	Warship	Bottleneck
Llama 8B FP16	16 GB	112 t/s	112 t/s	VRAM bandwidth (both)
Llama 70B Q4	35 GB	15 t/s	44 t/s	PCIe offload → all-VRAM
Llama 70B FP16	140 GB	0.6 t/s	22 t/s	Massive offload → 2-GPU TP
Llama 405B Q4	203 GB	0.4 t/s	8 t/s	Crawling offload → 3-GPU PP
Theoretical maximums: tok/s = bandwidth / model_bytes. Real-world ~20-40% less (KV cache, attention, overhead). Ratios hold.

Speculative Decoding = Augmenting Path

In Ford-Fulkerson, you increase flow by finding augmenting paths. SSD (Speculative Speculative Decoding, Tri Dao / ICLR 2026) adds a parallel path:

Without SSD

Target model generates one token at a time.

70B FP16 → 22 t/s

With SSD

Draft (8B on 5090) proposes N tokens.
Target (70B on 2× PRO 6000) verifies batch.
Accept ~3-5 tokens per draft step.

70B quality → 50-70 t/s (est.)

The 5090 stays in the Warship as the draft GPU. It adds an augmenting path through the flow network: cheap candidates verified in bulk.

Task	Memory	Hyperion	Warship
8B full fine-tune	~80 GB	Offload optimizer to RAM	1 GPU
70B QLoRA	~40 GB	Tight (32 GB VRAM)	1 PRO 6000
8B PPO (agent RL)	~160 GB	—	2× PRO 6000 TP
70B full fine-tune	~560 GB	—	Cloud only

The Min-Cut Migrates

Below the cliff

Model exceeds VRAM. Weights spill to RAM, read through PCIe at 64 GB/s. That's 28× slower than VRAM. Every GB past the cliff costs dearly.

Above the cliff

Model fits in VRAM. All weights read at 1,500+ GB/s. The bottleneck shifts from capacity to bandwidth—a better problem. Adding a second GPU doubles throughput via tensor parallelism.

The staircase chart is the whole story. Hyperion's cliff is at 32 GB. Warship's is at 224 GB. Every model between 32 and 224 GB goes from PCIe-bound to VRAM-bound—an average 20× speedup.

As you scale further: Tier 3 (4-GPU, $55K) pushes the cliff to 416 GB. Tier 4 (DGX, $300K) reaches 640 GB with NVSwitch mesh. The bottleneck migrates: VRAM capacity → VRAM bandwidth → inter-GPU sync → money. MAXFLOW finds the cut at every tier.

THE GAME: Build Your Inference Rig

MAXFLOW PC Builder — bandwidth staircase with component cards

You have a budget. You have a workload. Build the machine.

The staircase updates as you pick components. Every choice shifts the cliff—the point where your model overflows VRAM and performance drops 28×. Push the cliff right. Keep the staircase high. The max-flow determines your score.

Scoring Rubric

Flow Score

primary

Largest model at ≥20 tok/s. 8B = entry. 70B-Q4 = competitive. 70B-FP16 = strong. 405B = elite.

Min-Cut

diagnostic

Your bottleneck component. What to upgrade next: VRAM capacity, VRAM bandwidth, PCIe, lanes, or budget.

Efficiency

tok/s per $1K

Performance per dollar. Leaderboard: Budget (<$3K), Prosumer ($3-10K), Workstation ($10-40K), Research ($40K+).

SSD Bonus

×1.0–3.0

Draft GPU + target GPU = speculative decoding. Augmenting path multiplier on effective tok/s.

Component Deck

Pick from these cards. Each shifts the staircase differently.

GPUs — the main pipe

Card	VRAM	Bandwidth	TDP	Price
RTX 4060 Ti	16 GB	288 GB/s	165W	$400
RTX 4090	24 GB	1,008 GB/s	450W	$1,600
RTX 5090	32 GB	1,792 GB/s	575W	$2,000
RTX PRO 6000	96 GB	~1,536 GB/s	300-600W	$9,449
H100 SXM	80 GB	3,350 GB/s	700W	$25,000

CPUs — the lanes

Card	Cores	PCIe Lanes	RAM Ch.	Price
i9-14900K	24	20 (5.0)	2	$580
Ryzen 9 9950X	16	28 (5.0)	2	$550
TR PRO 9975WX	32	128 (5.0)	8	$3,924
TR PRO 9995WX	96	128 (5.0)	8	$7,469

RAM — the overflow pool

Config	Capacity	Bandwidth	Price
DDR5-5600 2-ch	192 GB	90 GB/s	$600
DDR5-5600 ECC 8-ch	256 GB	358 GB/s	$5,600
DDR5-5600 ECC 8-ch	512 GB	358 GB/s	$11,200

Challenge Scenarios

The Freelancer

Budget: $3,000

Run 70B Q4 at >10 tok/s. You have $3K. Go.

The Quant

Budget: $15,000

Run 70B FP16 inference + QLoRA 70B training. Two missions, one machine.

The Lab

Budget: $50,000

Run 405B Q4 at >15 tok/s AND train 8B agents with PPO.

The Endgame

Budget: unlimited

Run 405B FP16 at >20 tok/s. Minimize cost. This is the final boss.

The Staircase Mechanic

Every component shifts the staircase. The game is: make it tall and wide within budget.

Action	Effect on Staircase
Add GPU	Extends top step right (more VRAM → cliff moves)
Upgrade GPU	Raises top step (faster bandwidth per GPU)
Add RAM	Extends middle step right (more offload capacity)
Upgrade CPU	More PCIe lanes = feed more GPUs at full speed
Better SSD	Barely moves anything (startup speed only)

Training Badges

QLoRA Novice

QLoRA 8B

~12 GB VRAM. Any modern GPU.

QLoRA Pro

QLoRA 70B

~40 GB. Needs PRO 6000 or 2× consumer GPUs.

Agent Smith

8B PPO

~160 GB. Policy + ref + value + reward. 2× PRO 6000.

Cloud Cutter

70B full FT

~560 GB. Nobody does this locally. Yet.

HYPERION

WARSHIP (target build)

Inference: Tokens Per Second