GPU HA — High-availability failover for LLM inference

The thesis

Where DNS's reach ends

A streaming completion holds a connection open for 5–90 seconds, so DNS is out of the loop the moment the socket opens. And "which worker has the lowest time-to-first-token right now" changes second to second — a decision no TTL can carry. So the design splits cleanly: DNS owns what it's unbeatable at — coarse pool evacuation across regions and clouds — and an L7 hop owns the per-request decision: which worker serves this request, and what happens when it dies mid-stream.

The design

Two tiers, one telemetry frame

The load-bearing idea: both tiers are answering the same liveness question at different resolutions, so both are driven by a single content-free frame through the same ingest code. No worker ever asserts it's healthy — it emits raw numbers and stops when it dies. Silence = death.

Tier 1 · authoritative DNS

Coarse pool evacuation

Picks a pool. Evacuates a region or cloud whose telemetry goes dark — never a dead pool. When every pool is silent, it fails over to the "fail-whale": a graceful, protocol-correct degraded response instead of a connection error.

fresh pools in the answer • 10s pool eviction window • fail-open to a known-safe set

Tier 2 · L7 router

Per-request worker selection

One OpenAI-compatible endpoint. Scores live workers on time-to-first-token and VRAM headroom, and honors the first-token contract: failure before the first token is silently retried; after it, a clean truncation — never a counterfeit resume.

silent pre-token failover • 3s node freshness gate • circuit breaker + anti-herding

GPU HA architecture: client to Tier-1 DNS to Tier-2 L7 router to vLLM workers, with a shared telemetry frame — The actual architecture, straight from the repo — Tier-1 DNS and Tier-2 L7 driven by one shared telemetry ingest. (Diagrams `d1–d6` live in the repo.)

Stock software on the GPU

A tiny sidecar scrapes an unmodified vLLM's /metrics and emits UDP frames. The expensive box stays pristine; all the logic lives on cheap CPU boxes.

Content-free by design

The frame carries identity, sequence, and raw numbers — never a prompt or a token. Fleet health can flow to a dashboard without ever touching data sovereignty.

Multi-cloud, over the internet

No NVLink, no shared storage, no heartbeats. Just telemetry crossing NAT and real peering — GCP, Lambda, RunPod — evacuated on silence.

The proof

Drilled on real GPUs, then torn down to $0

Not a slide deck — a working proof-of-concept, exercised against real GPUs being killed mid-traffic across three independent clouds and three transport legs (localhost, LAN, public internet).

5/5requests survived a kill -9 of vLLM mid-traffic, zero client errors

260/260zero-error re-run after fixing the "accept-then-hang" grey-failure

88/88host-level failover across a NAT'd worker on a second cloud

185UDP seq-gaps that appeared only at silence — "silence = death" with a number attached

Every honest failure is written up too — the VRAM zombie, the framing-dispatch bug, the provider that moved a firewall default under the map. The recurring lesson: two things you assumed were one thing are actually two.

The story · part of the Paracoding series

There's a book about how this got built

GPU HA was built by a human supervising AI agents — an architect model and a browser-driving engineer. PARACODING: Everything Is Still a DNS Problem is the honest field journal of that: what broke, what it proved, and what "verify, don't claim" actually costs. It's free, DRM-free, right in the repo.

Read the PDF (free) EPUB paracoding.com ↗

Book 1 of the series — PARACODING: Decoding Human History as Physics, Not Religion — is on Amazon now. Both books, and more, live at paracoding.com.

PARACODING · Book 2

Everything Is Still a DNS Problem

Scott McDonald