On April 17, 2026, Anthropic's automated safeguards suspended a legitimate enterprise organization with sixty-plus accounts. No warning, no explanation, and the only path to appeal was a Google Form. Access came back five hours later, after a Twitter thread. The story is not unusual; vendor incidents happen to every infrastructure provider on the internet. What was unusual is that for a meaningful number of customers, an entire workflow stack was sitting on top of a single model API with no other path to a usable answer.
For a CISO running internally-built security agents, five hours of frontier-model downtime is not a degraded experience. It is a containment that did not happen. An identity that was not revoked. A ticket that was not opened. The board does not care which vendor was at fault.
The Single-Vendor Trap
Most agent code today imports the anthropic or openai SDK directly. The credential lives in an environment variable. The model id is hard-coded. When the vendor returns a 5xx, the SDK retries internally; when the vendor returns a 401 or a 403, the SDK gives up and the exception bubbles to the agent, which crashes or alerts.
This is the same pattern that gave us the “Stripe is down, the whole product is down” era of payments — and the same pattern that the rest of the industry solved by putting an HA layer in front of the vendor. Frontier models have not had that layer because, until very recently, no one needed it. They do now.
What “HA” Actually Means for an LLM
HA for an HTTP service is well understood: load-balance across replicas, fail over on transient errors, circuit-break on persistent ones. HA for an LLM is the same shape with two important differences.
First, the failures look different. 5xx and timeouts behave like any other web service, but 401 and 403 are no longer rare authentication mistakes — they are the most likely shape of a vendor-side suspension. 429 is not just rate-limiting; it is a noisy-neighbor problem you cannot fix from your side. Failover has to treat all of these as transient, not as caller errors.
Second, the failures that look transient sometimes are not. A 400 invalid_request means the payload itself is wrong. A content-policy refusal means the prompt itself was rejected. Failing those over to the secondary provider does not help — the secondary will reject them too — and silent re-routing of a content-policy decision is exactly the kind of thing your safety review will not approve.
How Arx Implements It
Every LLM call inside Arx flows through a vendor-neutral router. Callers do not pick a vendor; they pick a tier (frontier, fast, cheap) and a normalized request shape. The router resolves the tier to a provider-specific model id, decrypts the customer's per-org API key from the platform vault, and tries the primary provider. On a transient failure — 401, 403, 429, persistent 5xx, timeout — the router records the failure, advances to the next provider, and retries. On a deterministic failure — 400, content-policy — the router re-raises immediately and the agent sees the rejection.
A Redis-backed circuit breaker tracks failure counts per provider across all worker processes. After five failures in sixty seconds the breaker opens, and for the next two minutes the router stops sending requests to that provider entirely. When the cooldown elapses, one probe gets through; success resets the breaker, failure re-opens it. This is the part that matters during a real outage: when Anthropic is down for hours, you do not want every worker discovering that fact independently every five seconds.
Every call writes one row to the immutable audit log with three new fields: provider_used, failover_hops, and an attempts array containing each provider's outcome. None of this is opaque to the customer.
action_type: llm.chat
connector: llm
model_tier: frontier
provider_used: openai
failover_hops: 1
attempts: [
{ provider: "anthropic", kind: "LLMAuthError", status_code: 403 },
{ provider: "openai", ok: true, latency_ms: 712 }
]
usage: { total_tokens: 914 }
What Customers Configure
The defaults work out of the box: Anthropic primary, OpenAI failover. Customers who prefer the reverse flip the order from a settings page. Keys are stored encrypted per-org and can be rotated without a redeploy. Per-tier model overrides are JSON; an enterprise that wants to lock to claude-opus-4-7 for change-control reasons can do so without touching code.
None of this is exposed in the agent's API. The agent calls llm.chat() and receives a normalized response. The handoff — if any — is invisible to the agent and visible to compliance. That is the right division.
What This Is Not
This is not a router that promises a model is “better” than another model. It is not a cost-optimizer. It does not learn what kind of prompt to send to which vendor. Those are different products with different failure modes — and most of them, in our view, should not exist inside a security platform at all. Routing on cost or quality silently changes the outputs your audit row is bound to. Routing on availability does not.
An HA pair is plumbing. It has one job: when one vendor stops answering the phone, the other one picks up, and the audit log says exactly who served the call. That is enough.