Open reference implementation · Apache-2.0 · v1.0.0
Twenty years ago DNS-based failover kept websites alive. Does the same trick work when the thing dying is a GPU? Half of it does — and the boundary between the halves is precise enough to draw in code.
A streaming completion holds a connection open for 5–90 seconds, so DNS is out of the loop the moment the socket opens. And "which worker has the lowest time-to-first-token right now" changes second to second — a decision no TTL can carry. So the design splits cleanly: DNS owns what it's unbeatable at — coarse pool evacuation across regions and clouds — and an L7 hop owns the per-request decision: which worker serves this request, and what happens when it dies mid-stream.
The load-bearing idea: both tiers are answering the same liveness question at different resolutions, so both are driven by a single content-free frame through the same ingest code. No worker ever asserts it's healthy — it emits raw numbers and stops when it dies. Silence = death.
Picks a pool. Evacuates a region or cloud whose telemetry goes dark — never a dead pool. When every pool is silent, it fails over to the "fail-whale": a graceful, protocol-correct degraded response instead of a connection error.
fresh pools in the answer • 10s pool eviction window • fail-open to a known-safe set
One OpenAI-compatible endpoint. Scores live workers on time-to-first-token and VRAM headroom, and honors the first-token contract: failure before the first token is silently retried; after it, a clean truncation — never a counterfeit resume.
silent pre-token failover • 3s node freshness gate • circuit breaker + anti-herding
d1–d6 live in the repo.)A tiny sidecar scrapes an unmodified vLLM's /metrics and emits UDP frames. The expensive box stays pristine; all the logic lives on cheap CPU boxes.
The frame carries identity, sequence, and raw numbers — never a prompt or a token. Fleet health can flow to a dashboard without ever touching data sovereignty.
No NVLink, no shared storage, no heartbeats. Just telemetry crossing NAT and real peering — GCP, Lambda, RunPod — evacuated on silence.
Not a slide deck — a working proof-of-concept, exercised against real GPUs being killed mid-traffic across three independent clouds and three transport legs (localhost, LAN, public internet).
kill -9 of vLLM mid-traffic, zero client errorsEvery honest failure is written up too — the VRAM zombie, the framing-dispatch bug, the provider that moved a firewall default under the map. The recurring lesson: two things you assumed were one thing are actually two.
GPU HA was built by a human supervising AI agents — an architect model and a browser-driving engineer. PARACODING: Everything Is Still a DNS Problem is the honest field journal of that: what broke, what it proved, and what "verify, don't claim" actually costs. It's free, DRM-free, right in the repo.
Book 1 of the series — PARACODING: Decoding Human History as Physics, Not Religion — is on Amazon now. Both books, and more, live at paracoding.com.