Stop Choosing Between MCP and CLIs. Build a Harness That Makes Transport Boring.


Most agent infrastructure debates follow a familiar pattern. Teams argue about the protocol, commit to a direction, build toward it for six months, and then discover the real problem was never the protocol at all.

The MCP vs. CLI argument is exactly that debate. I’ve watched it play out at enough teams now to be confident that the ones losing the most time are not losing it because they picked wrong. They’re losing it because both choices, pushed past local iteration and into production platform infrastructure, hit structural walls that a different protocol choice wouldn’t have prevented. The walls are just in different places.

The question worth asking is not which transport to standardize on. It’s why you’re building systems where the transport choice matters to the agent at all.

How We Got Here

When MCP launched, there were genuine reasons to pay attention. Structured schemas, explicit tool definitions, auth flows that could survive a security review. For teams building in regulated environments it gave the governance layer something real to hold onto.

The early experience of actually running it was rough. Client restarts on configuration changes. Custom JSON-RPC server scaffolding for every capability a team wanted to expose. Debugging tooling that barely existed. For developers trying to move fast on local machines, the overhead was hard to justify when the alternative was handing an agent a bash tool and getting immediate results.

So the pendulum swung. Claude Code and Cursor made raw terminal access feel natural, and a new default emerged: give the agent a shell and stay out of the way. For local prototyping it works well. Then teams try to take it to production.

The CLI path has two failure modes that don’t show up until scale. The first is infrastructure cost. Maintaining stateful container sessions for operations that are fundamentally stateless API calls (reading a document, querying a SaaS platform, triggering a pipeline) is heavy engineering for what you’re getting. The second is the attack surface. Indirect prompt injection is not a theoretical concern. It’s a documented attack class where malicious content in a webpage, a file, or an email body manipulates an agent into executing destructive shell commands. Raw terminal access gives that attack vector a very large target.

The MCP path has its own failure mode at scale. Maintaining separate server processes for every capability, including the ones that are thin wrappers around a single REST API, creates operational overhead that compounds as the integration catalog grows. Teams that went all-in on MCP describe the same symptom: more engineering time on integration scaffolding than on the agent logic that was supposed to be the point.

The underlying reality of both paths is the same. Most CLIs are wrapping APIs. Most MCP servers are proxying APIs. The data is identical. The actions are identical. Only the transport layer differs. Once you see that clearly, the debate collapses. You don’t need to pick the right transport. You need an orchestration layer that treats transport as an implementation detail.

What the Harness Needs to Do

The concept is simple to state. Your orchestration layer accepts any capability format, normalizes it, and hands the model a unified tool catalog. MCP server, OpenAPI spec, GraphQL endpoint, REST API, shell script: the harness ingests all of it. The agent never knows or cares what’s executing underneath.

Building it well is a serious engineering investment, and I want to be honest about that because most writing on this topic glosses over the cost. If your team is early-stage and running a handful of integrations, the wrong move is to build this now. Use whatever is fastest. But if you are building foundational enterprise infrastructure, this is not a premature optimization. Even if frontier model context windows become effectively infinite or native tool execution improves, a regulated enterprise can never allow an LLM to directly execute unaudited, non-rate-limited system calls. This abstraction layer isn’t for the model. It’s for the engineers who have to govern the platform. You pay for the abstraction either way. The question is whether you design it or inherit it.

There are five concrete engineering problems the harness has to solve. Each one has a failure mode that bites in production and doesn’t announce itself clearly in a local development environment.

The Tool RAG Problem

The naive approach to agent tooling loads every available tool definition into the context window at startup. At small scale this is fine. Once you’re running dozens of integrations with hundreds of tool definitions, it breaks reasoning and burns tokens before the agent has done anything useful.

The standard fix is dynamic tool retrieval: maintain a semantic index of all registered tools and inject only the definitions relevant to the current task. The agent working a code review task gets GitHub and Jira tool schemas. The same agent answering a question about customer health gets Salesforce and analytics schemas. Same harness, radically different effective tool surface.

The problem with naive semantic retrieval is embedding collapse. Tool schemas are short, highly technical, and domain-specific. Vector embeddings of a custom internal shell script named verify-cluster-state.sh and a standard cloud health check tool may land close enough in embedding space that retrieval pulls the wrong one. The agent calls a tool that isn’t there, or worse, calls the wrong one silently.

The harness needs a two-tier routing system. Deterministic, namespace-based matching handles explicit tool references first and bypasses retrieval entirely. Fuzzy semantic retrieval handles everything else, but that vector index needs to be fed by more than just schema text. Historical tool-calling telemetry, runtime system state, and synthetic intent variations all need to flow into the index to make retrieval reliable on the long tail of real user queries. Building the index is one engineering task. Keeping it calibrated is an ongoing operational one.

For engineers: The fastest way to detect embedding collapse in your tool index is to build an offline eval harness before you go to production. Take a sample of real or synthetic agent intents, run them through retrieval, and manually verify the results. The right k to measure depends on your injection strategy: if you’re doing single-tool selection, precision@1 is what matters; if you’re injecting a ranked set for the model to choose from, measure precision@3. Either way, pay specific attention to internal tools with opaque names. If the correct tool isn’t ranking at the top for your most common intent patterns, you need namespace-based fallback routing before you ship. Retrofitting it after the first production mismatch is significantly more painful.

The Distributed Shell Memory Problem

Treating CLI execution as a stateless API call breaks immediately when an agent does sequential work. The agent runs cd /var/www/html in one turn and npm install in the next. A naive stateless harness executes that second command in the root directory. The failure is silent if you’re not watching closely, and you’re usually not watching closely in production.

The obvious fix, maintaining long-lived stateful container sessions per agent task, creates a different problem. In an auto-scaling environment, sticky sessions pinned to specific gateway instances become architectural bottlenecks. Under load they become single points of failure.

The approach that holds up is to push statefulness into the execution payload, not the infrastructure. The orchestration engine compiles sequential local actions into atomic, multi-command execution blocks (cd /path && npm install). When an agent emits a terminal intent, the engine injects the full environment state into an ephemeral container token for that specific execution. To prevent Turn 2 from running against a blank slate where previous file modifications are missing, the runtime containers attach to a shared persistent volume anchored to the agent’s session ID, whether that’s a ReadWriteMany PVC in a Kubernetes environment, an NVMe-over-Fabrics pool if your infrastructure supports it, or equivalent shared block storage. The compute layer stays stateless and disposable, while the file state follows the session deterministically. To keep this stable under concurrent operations, the harness enforces a serialized FIFO queue per session ID. If an agent requires parallel sub-agents, they spin up isolated sub-sessions with cloned ephemeral volumes rather than competing for a single file lock.

For engineers: The hard part is the handoff when a long-running agent task spans a gateway restart or a deployment. Store environment variables and the current working directory path in your task queue’s state layer or a dedicated Redis key tied to the session ID, not in gateway instance memory. Any state that resets on process restart will surface as a broken terminal session at the worst possible time.

The Safety Isolation Problem

Text-based safety filtering for shell access does not work in production. Blocking rm -rf at the string-matching layer is the start of a cat-and-mouse game that Linux utilities will win. Destructive commands can be expressed through environment variable expansion, base64-encoded payloads, shell aliases, and combinations of individually benign-looking operations. An agent can reach the same destructive outcome through a dozen paths that pass naive text filters.

The harness has to enforce isolation at the kernel level, not the text layer. Sandboxed execution environments using gVisor or WebAssembly runtimes, combined with kernel-level system call monitoring via eBPF, enforce least-privilege Unix permissions at the OS layer. The destructive operation is blocked before it executes, regardless of how it was expressed.

This matters beyond the obvious security case. It’s the only architecture that lets you apply uniform safety policies across all tool types. A CLI command and an MCP server action can be subject to the same policy enforcement because the enforcement happens below the transport layer. A tool tagged destructive: true requires human-in-the-loop confirmation whether it came in as a shell command or a structured MCP call. You write the policy once and it applies everywhere.

For engineers: gVisor latency overhead varies significantly by workload type. For syscall-light tasks, the penalty is often 20 to 30 percent. For I/O-heavy tasks like running package managers or grepping large log volumes, the overhead can reach 2x to 5x in practice, and pre-warmed pools only partially offset that. Profile your specific workloads before committing gVisor across all tool types. A hybrid model, gVisor for high-risk tool categories, standard containers with eBPF monitoring for read-heavy operations, often produces better overall performance than a uniform policy.

The Auth Mapping Problem

CLI tooling handles authentication through local credential files, environment variables, or session state baked into the container. This works on a developer’s laptop. At platform scale it produces a credential management situation that is genuinely hard to audit. When something goes wrong at 2am, “which credential was in scope for that container during that agent turn” is not a question you want to be reconstructing from scattered config files.

The harness handles identity at the platform layer. Enterprise authentication, OIDC, OAuth 2.0, service account tokens, maps to each registered endpoint at registration time. When an agent calls a tool, the harness executes a least-privilege token exchange, taking the enterprise identity token and generating an ephemeral token scoped strictly to the resource, path, and duration required for that single action. The agent never touches credentials directly.

The latency problem with this approach is the synchronous round-trip to the identity provider on every tool call. Against AWS STS or HashiCorp Vault at high call volume, that overhead compounds. The harness needs an encrypted token cache local to the execution gateway, holding ephemeral tokens for the duration of the agent’s task session. When the central identity layer detects a policy violation mid-turn, it broadcasts an invalidation event across the fleet. Because a distributed pub/sub broadcast can fail during a network partition, gateways use pessimistic validation: cached tokens carry ultra-short TTLs, typically under sixty seconds, with a deterministic cryptographic check as the fallback if the gateway loses contact with the auth cluster.

The TTL creates a real problem for long-running agent tasks. A ten-minute task will cycle through several token refreshes mid-turn. Because environment variables are immutable once a Linux process forks, injecting raw credentials at startup means mid-turn refreshes will still fail against the old token. The harness has to handle those refreshes transparently using file-mounted tokens or a local loopback credential proxy inside the sandbox, updating the secret on disk dynamically so active processes can read the fresh token without blocking the agent or passing credentials through the execution context where the model could see them. This needs explicit design, not an afterthought.

For engineers: The audit trail this architecture produces is one of the things that makes the difference in enterprise security reviews. Every tool call is authenticated through a single identity layer, which means every tool call appears in a single audit log regardless of whether the underlying tool is an MCP server, a shell command, or a direct REST call. If you are building for enterprise deployment, design the audit schema first and work backward to the token exchange architecture. The schema you choose determines what questions you can answer after an incident, and you will have incidents.

The Complexity Centralization Problem

This is the one that’s worth being direct about, because it doesn’t appear in most architectural writeups on this topic.

Building a transport-agnostic harness does not eliminate integration complexity. It centralizes it. You are trading distributed technical debt spread across twelve ad-hoc integrations for a single, well-designed system that you now own and must maintain. If the harness has a bug, everything breaks at once. If the schema normalization layer has an edge case, it affects every tool in the catalog simultaneously.

The way to manage this is strict interface contracts between the translation layer and the core orchestration logic. Each transport adapter, OpenAPI, MCP, GraphQL, raw CLI, owns the translation from its native format into the harness’s internal semantic contract. To minimize blast radius, these adapters must be decoupled from the core routing logic as independently deployable, isolated micro-modules. When a specific upstream tool updates its specification, only that micro-adapter changes and rolls out. If an edge case brings down a single adapter, it degrades that capability in isolation and leaves the orchestration core unaffected.

Adapter versioning is necessary but not sufficient. The harder problem is runtime schema drift: the OpenAPI adapter gets updated to v2 of a tool’s spec while the retrieval index was built against v1. The agent gets stale tool definitions injected and either calls with wrong parameters or gets a tool call error that’s difficult to trace back to an index freshness problem. The retrieval index needs to be versioned alongside the adapter registry, with a reconciliation job that detects mismatches and either reindexes or flags the discrepancy before it surfaces as a production failure.

This also means the harness is the right place to invest in observability. Log every tool call with the transport type, the adapter version, the schema version used for retrieval, and the execution result. Loose application logs are not enough here. Tracing an error across an abstraction layer, a local loopback auth proxy, and a gVisor sandbox requires an explicit distributed tracing framework using OpenTelemetry context propagation. The agent session trace ID needs to be injected as an immutable parent span across the entire execution boundary, from the LLM orchestration layer down through the sandbox kernel. Without that telemetry thread, debugging transient latency spikes or timeouts under load is close to impossible.

The Structural Trade-offs

Worth being explicit about what you’re accepting when you build this.

Raw terminal access lets an agent chain complex operations through standard Unix pipes in a single turn. The harness breaks that open-ended composability by design. You’re choosing predictability and policy enforcement over multi-stage shell piping. That’s a conscious trade, not a limitation to engineer around.

You also pay a token cost for version parity. Models carry pre-existing training on common CLIs like kubectl and aws-cli, which means raw terminal execution costs nothing in context overhead for well-known stable tools. But that same training data is the source of hallucinated flag syntax on less common tools or tools that have changed their interface since the training cutoff. Dynamic tool injection via RAG costs tokens on every turn, but it ensures the agent is working from your actual current spec rather than a model’s best guess at it. For internal tooling with no training signal at all, the RAG cost is not a tradeoff. It’s the only way to give the agent accurate tool knowledge.

The distributed vs. centralized failure mode tradeoff is the one teams underestimate most. Fragmented per-team integrations accumulate quietly. The harness breaks loudly, all at once, when something goes wrong. That’s actually the better failure mode for a platform you’re operating at scale, but it requires a different kind of operational discipline than most teams have built before they need it.

The Multi-Tenancy Problem

Most architectural writing on harnesses assumes a single team owns the full tool catalog. At enterprise scale that assumption breaks quickly. When multiple product teams share the same harness, you have three problems that don’t exist in the single-tenant case.

Tool catalog isolation: Team A’s internal Salesforce integration should not be retrievable by Team B’s agent. The retrieval index needs tenant-scoped partitioning with strict, tenant-isolated vector namespaces. Naming conventions drift and get violated; partitioning has to be enforced at the infrastructure layer, not in a shared index where a poorly named tool from one tenant can pollute retrieval results for another. Before tools hit the model’s context window, the translation layer should prepend deterministic tenant prefixes to the tool names themselves (for example, tenantA_get_user). Without explicit namespacing, cross-tenant semantic drift will cause embedding collapse and your dynamic tool RAG will pull the wrong endpoint silently.

Auth boundary enforcement: The least-privilege token exchange described earlier works cleanly in a single-tenant model where one enterprise identity maps to one set of permitted endpoints. In a multi-tenant harness, the identity layer has to enforce that Team A’s agent can only exchange tokens for endpoints registered under Team A’s tenant scope. Cross-tenant tool calls need to be impossible by construction, not by policy documentation that someone has to remember to follow.

The same principle extends to execution isolation. Pre-warmed sandbox pools should remain generic at the infrastructure layer, with tenant identity, volumes, and cryptographic keys hot-swapped in at the moment of execution. This keeps operational readiness decoupled from idle compute costs, and ensures that no tenant’s execution context bleeds into another’s even under concurrent load.

I’ll be direct: multi-tenancy is where the complexity centralization problem compounds most sharply. The patterns above are tractable for a single team. Adding tenancy boundaries across retrieval, auth, execution isolation, and observability is a meaningful additional engineering surface. Teams building toward enterprise platform use should treat multi-tenancy as a first-class design constraint from day one, not a feature to add later.

What Comes Next

The broad adoption of MCP as a packaging standard was a genuinely useful development. Getting the industry to agree on a common way to describe and expose agent tools is valuable infrastructure. But the protocol was never the destination. It’s a packaging format. The destination is an execution environment sophisticated enough to treat packaging as an implementation detail.

The teams I’ve seen pull ahead in agent infrastructure aren’t the ones who made the right protocol call in 2024. Most of them made the same messy choices everyone else did. What separated them was how quickly they made the transport question boring. Once the harness exists, the protocol debate becomes a one-afternoon integration task instead of a six-month architectural commitment. That’s the actual leverage: not the architecture itself, but the organizational freedom it creates. The engineers who were arguing about MCP vs. CLI are now building agent logic. That’s where the compounding starts.

The Builder, the Critic, and the Circuit Breaker: How I’d Design AI Agents That Don’t Bankrupt You


Most multi-agent architectures look elegant on the whiteboard. Agent A generates the output. Agent B judges it against a strict checklist. No AI grading its own homework, clean separation of concerns, autonomous iteration until the job is done.

Then you open the billing dashboard.

Your agents are locked in a loop, Agent A revising, Agent B rejecting, neither making meaningful progress, while your token costs compound by the minute. This failure mode has been documented across production agent deployments at companies ranging from early-stage startups to large enterprises, and it almost never shows up in the architecture review.

The system passes every test in a supervised environment. It breaks in a specific way when nobody is watching.

That distinction matters more than it might seem. We are in the middle of a fundamental shift in how AI is actually deployed, from synchronous AI, where a human waits for a response and can intervene at any point, to unattended autonomy, where agents run complex multi-turn workflows in the background, entirely out of sight. When a human isn’t watching the screen, agents can quietly go to war with each other. You end up funding both sides.

The design constraint that most teams miss isn’t the AI itself. It’s the discipline of using thin, deterministic code to cage non-deterministic intelligence, knowing when to pull the plug, and building the infrastructure to do that gracefully before the token burn compounds.

What follows is the architecture required to enforce boundaries on unattended autonomy, and the reasoning behind every decision.

The Two Failure Modes of Cognitive Deadlock

The two primary failure modes of multi-agent systems are near-opposites, and a naive fix for one reliably triggers the other, trapping your system in a Cognitive Deadlock.

Sycophancy is the first. AI models are naturally drawn to fluent, confident-sounding text. Agent B, your critic, will frequently approve Agent A’s output simply because it reads well on the surface, missing underlying logic errors, hallucinated facts, or reasoning gaps. You think you have a quality gate. You actually have a mutual appreciation society.

Token Churn is what happens when you try to fix that. You instruct Agent B to be harsher: find flaws, if you don’t find one you’ve failed. Now it rejects everything. Agent A revises but hits the ceiling of its own capability, returning a marginally different version. Agent B rejects again.

Software agents have no concept of time, urgency, or money. A human stuck in this loop would eventually say, “I don’t think we’re getting anywhere.” An autonomous agent will simply keep burning tokens indefinitely at your expense.

One nuance worth carrying into the architecture: these failure modes don’t always arrive sequentially. On open-ended analytical or creative tasks, both can coexist within the same loop, Agent B approving weak outputs on some rubric criteria while rigidly rejecting on others. The failure landscape in practice is messier than a clean either/or, and the system needs to account for that from the start.

Why More AI Makes This Worse

The instinct when something breaks in an AI system is to add more AI. Another monitoring model. Another validation layer. Another agent to watch the agents.

This is almost always the wrong move. You’re solving an orchestration problem, which is fundamentally about enforcing hard boundaries, with a tool designed for open-ended reasoning. The result is compounding unpredictability at compounding cost.

The fix is simpler and older: a thin layer of traditional, deterministic code sitting above the models. It doesn’t reason or deliberate. It counts, measures, and cuts. Think of it less like a manager and more like a circuit breaker on an electrical panel, no understanding of what’s flowing through the system, but precise knowledge of when to shut it off.

This is Architectural Minimalism applied to agent systems: the sharpest possible line between what AI calculates and what traditional code enforces. The models handle open-ended reasoning. The orchestration layer holds the edges rigid. Three patterns form the core of that layer, each one a direct response to a failure mode that emerges without it.

Pattern One: The Hard Cap

The simplest and most important constraint: set a maximum iteration count and make it unconditional.

Four to five rounds is the absolute ceiling for enterprise workflows. If the loop hasn’t resolved by then, the agents aren’t converging on an answer, they’re producing token churn. The orchestrator ends the loop regardless of where things stand.

This feels blunt. That’s the point. The value of a hard cap isn’t intelligence, it’s unconditional enforcement. No amount of confident output from either agent can override a counter hitting five.

For engineers: The iteration counter must live in persistent state completely outside the model context, agents have no visibility into it and cannot influence it. On each cycle, before invoking either model, the orchestrator checks the counter and raises a LoopLimitExceeded exception if the threshold is met. This state must survive process restarts; an in-memory counter that resets on an unhandled exception defeats the purpose entirely. Use your task queue’s native state store or a lightweight Redis key with a TTL set conservatively above your maximum expected loop duration. Tag every loop invocation with a correlation ID from initialization, you’ll need it for the observability layer, and retrofitting it later is painful.

Pattern Two: Stagnation Detection

Agents frequently fall into token churn well before hitting the hard cap. Agent A stops making structural changes and starts rewriting surface prose, “furthermore” becomes “in addition,” paragraphs get reorganized without changing substance. From the outside it looks like active progress. It isn’t. And it burns tokens at the same rate as genuine work.

The orchestrator catches this by measuring how much content actually changes between rounds. When the delta drops below a meaningful threshold, it recognizes the agent has exhausted its ideas and breaks the loop early, before the hard cap is reached.

The starting threshold sits at roughly 5%, derived from a straightforward observation: meaningful structural revision typically moves at least 15 to 20% of content. Token churn clusters near zero. The 5% line sits safely in the floor of the gap between them, protecting against slow structural leaks. That said, the right threshold depends on your output type, and the only honest way to calibrate it is to instrument first and tune from data.

For engineers: Choose your similarity metric carefully, token overlap and normalized edit distance behave differently depending on output structure, and picking wrong silently breaks your stagnation detection. For open-ended prose, token overlap is more stable. For structured enterprise outputs, such as JSON schemas, legal clauses, or templated compliance documents, normalized edit distance is more sensitive to meaningful changes.

Calibration requires labeled examples: a set of round-pairs that a domain expert has manually classified as either genuine revision or surface churn. Fifty to one hundred labeled pairs is typically enough to validate your metric choice and threshold. Run both metrics against this set, compare precision and recall, and commit to whichever performs better on your specific output type. This step takes a day and saves weeks of silent miscalibration in production. A threshold that fires too early causes unnecessary rollbacks. One that never fires is invisible until you audit the bill, by which point the cost is already spent.

Enterprise constraint: Running text-distance calculations on large outputs inside the orchestrator can introduce latency. Handle this efficiently inline for outputs under roughly 10,000 tokens, or offload to an async worker for longer documents.

Pattern Three: Temperature Stepping

Temperature controls how deterministic or exploratory a model’s output is. Low temperature produces focused, precise responses. High temperature introduces variance, and occasionally breaks a model out of a logical rut it cannot escape through refinement alone.

Rather than holding temperature constant, the orchestrator steps it based on loop position:

  • Rounds 1–2: Low temperature. Focused, precise output.
  • Round 3: The orchestrator injects a direct intervention prepended to the system message: You have been rejected twice. Do not refine your previous approach, abandon it and try something structurally different.
  • Round 4: Temperature is increased, forcing genuine exploration rather than iteration on a failing strategy.
  • Round 5: Hard cut.

The architecture mirrors a core tenet of classical optimization: a system stuck in a local minimum needs a controlled injection of variance to escape. That is the logic behind simulated annealing, and it applies meaningfully to language model behavior. A model stuck at low temperature on a problem it cannot solve will keep producing the same wrong answer with increasing confidence. The temperature bump is a last resort before human escalation, not guaranteed to work, but cheap enough to always be worth attempting.

For engineers: Set temperature at the API call layer, not within the prompt. The Round 3 system message injection should be prepended, not appended, since position affects attention weighting in current transformer architectures. Log temperature values alongside each draft in your run record; when debugging a failed loop, temperature trajectory is often the fastest signal for distinguishing genuine exploration from a stuck model.

Enterprise constraint: Several corporate API gateways restrict per-call temperature adjustments. If that’s your environment, substitute Prompt Hardening: at Round 3, replace the system prompt entirely, switching from an open-ended coaching prompt to an aggressive few-shot prompt that mandates strict structural compliance. The mechanism differs; the intent is identical.

Instrumentation: The Operational Audit Trail

Treat observability as a first-class design concern, not something to retrofit after the first production incident.

Without it, the circuit breaker layer is a black box. You know it fired. You don’t know which pattern triggered it, how often, on what input types, or whether your thresholds are correctly calibrated. You cannot improve what you cannot see, and in an unattended system, you may not notice what’s wrong until the bill arrives.

Log the following for every loop run: correlation ID, total iterations, exit reason (hard cap, stagnation, success, or human escalation), delta score per round, Agent B rubric result per round, temperature per round, and whether a best-effort draft was delivered or a handoff was triggered.

These logs should aggregate into a simple operational dashboard displaying iteration count distribution, circuit break frequency by type, rollback rate, and handoff rate. Within a few hundred runs you’ll have enough signal to tune thresholds from data rather than intuition. This log also becomes your audit trail when a stakeholder asks why a specific high-value output was delivered as a partial draft rather than a completed one.

When the Circuit Breaks: Delivering Something Useful

Stopping the loop prevents runaway cost. But there’s still a user waiting for a result, and “the AI gave up” is not a product experience. Graceful failure is an engineering requirement, not an afterthought.

Save the best draft, not the last one. The orchestrator caches every draft alongside the structured rubric score Agent B assigned it. If Round 2 produced a draft passing 85% of quality criteria and later rounds failed to improve on it, the system rolls back and delivers Round 2. The user gets an imperfect but usable result instead of an error screen. In most business contexts, 85% complete is a workable starting point, not a failure.

That percentage is only meaningful if the rubric behind it is well-defined. Agent B’s checklist needs explicit, binary criteria, not is this good? but does this section contain a pricing breakdown? Yes or no. The score is passed criteria divided by total criteria.

One example of a criterion that looks binary but isn’t: “is the tone professional?” That’s a judgment call dressed as a checkbox. Two evaluations of the same document will produce different scores, which corrupts your rollback logic and makes your quality threshold meaningless. The test for a valid rubric criterion is whether two different reviewers, given the same document, would independently reach the same answer. If they wouldn’t, decompose the criterion until they would, before it touches production.

Communicate failures in human terms. If no draft clears the minimum threshold, the response should be specific and actionable: We generated 80% of your proposal but encountered a conflict in the pricing section. The draft has been saved and flagged for review. Structure this as a typed failure object that downstream systems can parse and route, not a string that gets logged and forgotten.

Hand off to a human with full context. When the loop breaks, the orchestrator packages the original prompt, the highest-scoring draft, the specific failed rubric criteria from Agent B, and the execution trace. A human reviewer sees precisely where the AI got stuck, fixes that specific gap, and approves. Targeted human judgment applied at the exact point of failure, not a restart from scratch.

The Enterprise Agent Paradox

This handoff introduces the sharpest organizational challenge in agentic system design, and it’s one the architecture alone cannot solve.

If an autonomous agent circuit-breaks on a high-value client document, who gets the notification? What is their SLA to respond? If a human reviewer takes four hours to log in, review, and patch the 15% gap the AI couldn’t close, has your unattended autonomy actually saved the enterprise any time, or did it just shift the operational friction downstream?

The bottleneck of an agentic system is rarely the AI. It is your human routing architecture.

The practical resolution is Omnichannel Exception Routing: do not build custom review interfaces from scratch. Export standard schemas, such as OpenTelemetry events or typed webhooks, that inject circuit-break exceptions directly into your existing enterprise ticketing infrastructure, like ServiceNow, Jira, or whatever queue your team already monitors. The human side of the handoff needs to live where human attention already lives, governed by SLAs that already exist.

Building a parallel review system is an organizational change management problem disguised as an engineering task. Teams that treat it as purely technical consistently underestimate the timeline by a factor of three. Plan for that gap explicitly; the technical handoff can be built in a day, but getting the human side right typically takes weeks and involves stakeholders well outside the engineering team.

The Cost Case for Building This Upfront

A loop running to the hard cap consumes roughly five to ten times the tokens of a successful two-round completion on the same task. At scale, with hundreds or thousands of daily agent invocations, uncontrolled loops don’t just create poor user experiences. They create billing events that compound faster than most teams notice before the monthly statement arrives.

The orchestration layer described here adds minimal compute overhead. The logic is deterministic code running entirely outside the model context. The investment is engineering time, specifically days to weeks to build and calibrate. The alternative is discovering your failure thresholds the hard way, at production volume, on your cloud bill.

Build the circuit breaker before you need it. By the time you need it, you’re already paying for not having it.

The Principle Underneath the Architecture

Every instinct in agent system design pulls toward more intelligence, a smarter critic, a more capable builder, another layer of AI judgment applied wherever the last layer fell short. That instinct is understandable and almost always counterproductive.

The pattern that holds up under pressure is a strict division of labor: AI handles the open-ended reasoning, traditional code enforces the boundaries, existing human workflows handle the exceptions. Not as a fallback for a broken system, but as a deliberate architectural choice in a working one.

There’s a counterintuitive implication worth carrying forward: as AI agents become more capable, the orchestration layer becomes more important, not less. A more capable agent running in an uncontrolled loop causes more damage, faster, at greater cost than a weaker one. The sophistication of the AI and the rigidity of its constraints need to scale together. Every increase in autonomy is an argument for a more robust circuit breaker, not a less necessary one.

Autonomy without boundaries isn’t architectural minimalism. It’s just risk with a better interface.

Build the builder. Build the critic. Make sure something that doesn’t think is always in charge of knowing when to stop.

Your AI Stack Has a Geopolitical Risk. Your Board Doesn’t Know It Yet.



In March 2022, the U.S. Office of Foreign Assets Control imposed sweeping sanctions on Russian entities following the invasion of Ukraine. Within 72 hours, Microsoft Azure, AWS, and Google Cloud began cutting off services to affected Russian customers. Businesses that had built mission-critical workflows on those platforms discovered, at speed, that a U.S. government directive could reach inside their operations regardless of where they were headquartered or where their data sat.


That moment was a warning. Most enterprise boards filed it under “geopolitical tail risk” and moved on.
The same logic has now arrived at the AI layer, and the dependencies are deeper, the warning period shorter.


We already have a preview. When Italy’s data protection authority banned ChatGPT over privacy concerns, local businesses relying on it for operational workflows lost access overnight. The Rome Court ultimately annulled the subsequent 15 million euro fine in March 2026, but it did so on a single jurisdictional point: once OpenAI established its Irish subsidiary, the Irish Data Protection Commission became the lead supervisory authority, stripping the Italian regulator of its right to issue a final sanction. The court never examined whether the underlying data practices complied with GDPR. 

Boards should not mistake a jurisdictional escape hatch for an operational green light. The real lesson was the speed of the initial disruption. One regulatory decision, and a core enterprise tool was gone with zero advance notice and zero transition period.


Italy was an early flashpoint. The regulatory landscape has since shifted structurally. The EU AI Act’s high-risk obligations are in the final stages of legislative revision, with a new enforcement deadline of December 2027 agreed in May 2026, a postponement that reflects political complexity rather than reduced intent. The BIS framework in the U.S. is tightening. The question is no longer whether a regulatory action will disrupt your AI operations.

The Dependency You Probably Haven’t Stress-Tested

Ask your CTO a simple question: if your primary frontier AI model provider became unavailable for 30 days due to a regulatory action, an export control directive, or a government-mandated review, what would break, and how quickly?


For most enterprises, the honest answer is deeply uncomfortable. Over the last 18 months, AI has moved far beyond experimental chatbots. It has been woven into autonomous, multi-agent workflows that run core operational pipelines: customer service execution, automated contract analysis, code generation, financial modelling, and compliance screening. When these integrated systems run on a single model, a vendor blackout does not just stall a user query, it halts the automated engine of the business.


Unlike a SaaS CRM or a cloud storage provider, frontier AI models are not commodities. They are concentrated in a handful of U.S.-headquartered companies (Anthropic, OpenAI, Google DeepMind) whose foundational IP and cloud infrastructure are now explicitly subject to tightening U.S. export controls, national security reviews, and data retention mandates that may directly conflict with local privacy regulation. GDPR is only the most obvious example. India’s DPDP Act, Brazil’s LGPD, and the EU AI Act’s transparency requirements all create potential collision points with U.S. vendor terms of service.


Your board does not need to understand transformer architecture. It needs to understand that treating frontier AI as a politically neutral utility, the way you might treat electricity or broadband, is now a critical governance error.

The Strategy: Sovereignty and Hedging

Navigating this requires moving the conversation out of the engineering backlog and into the boardroom, focusing on three strategic pivots.

1. Mandate a Hybrid Model Architecture

The open-weight vs. closed-source debate is no longer an engineering preference; it is a sovereignty conversation. Models like Meta’s Llama series or Mistral can be self-hosted within your own infrastructure perimeter, giving you operational custody and insulation from a foreign vendor’s sudden API kill-switches, executive orders, or unilateral changes to data retention policies.

The right architecture is a tiered model: closed frontier systems reserved strictly for high-stakes, hyper-complex reasoning tasks where capability genuinely justifies the concentration risk; open-weight models running in your own environment for core operational workflows where availability and data sovereignty matter more than the last percentage point of benchmark performance.
The board must demand clear accountability: who owns the decision about which corporate workflows are allowed to tolerate external model dependency, and what is the review cycle?

2. Implement an Independent Orchestration Layer

CFOs do not leave a company’s currency exposure unhedged on the grounds that exchange rates are probably fine. The same discipline should apply to model provider exposure. An intelligent orchestration layer, or model router, must sit between your applications and your model providers. If a primary provider goes offline or changes its terms in ways that conflict with local regulation, the router redirects traffic to a secondary provider or a locally hosted model automatically.

The parallel to treasury is precise: you are not predicting that a provider will fail; you are ensuring that if it does, your operations survive. Do not expect the frontier labs to build this for you. Their business model relies on maximising your consumption of their flagship compute, and they lack your specific business context to route effectively.


This architecture requires planning for graceful degradation. In practice, this means having a fallback ready before you need it. If your primary frontier model goes dark, your orchestration layer must route workflows to a localized, self-hosted model that can securely handle the baseline transaction, keeping core operations running even if advanced reasoning is temporarily unavailable. The cost of building this independent routing layer is a fraction of the operational cost of a 48-hour AI outage across a large enterprise.

3. Map the Fragmented Global Risk Profile

The exposure is not uniform, and a global business must audit its risk based on where its delivery stacks actually sit.


For enterprises with China operations or Chinese ownership structures: This is the one most likely to surface a legal exposure your board does not know it has. U.S. frontier AI models are simply unavailable in mainland China. OpenAI cut off API access in July 2024 following U.S. Treasury restrictions on technology investment flows into China. The risk for global enterprises runs deeper than geography: Anthropic updated its terms of service in September 2025 to prohibit access for any entity more than 50% owned by a company headquartered in a restricted region, regardless of where that entity actually operates. A joint venture with a Chinese majority shareholder, incorporated and operating in Singapore or the UAE, may already be outside the terms of your AI vendor contracts. This is a legal and compliance exposure that needs to be audited now, at the entity level, across your full ownership structure.


For enterprises with significant India operations: India has become the execution layer for global business process automation and autonomous agent deployment. Building those stacks entirely on U.S.-centric closed models imports downstream regulatory risk into every client delivery. Navigating this requires a dual-track strategy. While enterprises must continue to leverage established global models for current production baselines, they must simultaneously fund parallel validation tracks for sovereign alternatives. India’s BharatGen Param2, a 17-billion parameter mixture-of-experts model trained on 22 trillion tokens of multilingual data using government-backed indigenous compute infrastructure, proves that open-weight alternatives are ready for enterprise testing. The immediate mandate for boards is not an immediate shutdown of current APIs, but the funding of shadow testing environments to ensure long-term architecture flexibility.


For U.S.-headquartered enterprises: The regulatory line has been drawn at the computational threshold of 10 to the power of 26 floating-point operations, the statutory boundary establishing a system as a frontier model under California’s Transparency in Frontier Artificial Intelligence Act (SB 53). Developers generating more than 500 million dollars in annual gross revenue face the most intensive obligations under the Act: they must publish annual catastrophic risk frameworks and report critical safety incidents to state emergency agencies within 15 days of discovery, shortened to 24 hours if the incident poses an imminent risk of death or serious physical injury. Violations are enforced by the California Attorney General and carry civil penalties of up to one million dollars per violation. An enforcement action by the California Attorney General against a primary lab would trigger an immediate operational blackout for any single-sourced enterprise. But if your organization also holds federal contracts or operates in regulated markets, that upstream compliance failure will bleed directly into your own legal and audit risk profiles overnight


For European enterprises: The political agreement reached in May 2026 to defer the EU AI Act’s high-risk obligations to December 2027 gives enterprises more runway, but it does not change the architecture decision. The data retention and monitoring policies that U.S. AI vendors operate under remain on a collision course with what European regulation will ultimately require. Using the delay to build compliance-ready infrastructure is the opportunity; treating it as a signal to stand down is the mistake. Domestic alternatives, Mistral and Aleph Alpha, are not inferior substitutes. They are the only providers whose architecture is designed from the ground up to operate within European regulatory constraints.

What Should Be on the Next Board Agenda

Three governance actions, each with a clear executive owner:
Commission an AI dependency audit. Map every workflow that touches an external model provider, classify each by operational criticality, and calculate what a 30-day outage would cost. This risk quantification must produce a concrete number the board can act on.
Assign explicit ownership. Move AI vendor risk onto the enterprise risk register with a named executive owner, likely the CTO or CISO, and a defined quarterly review cadence. If it currently lives nowhere, that gap is itself a governance finding.
Establish a sovereignty threshold. Define what proportion of core operational workflows must run on infrastructure your organisation controls directly, and set a hard timeline for reaching it. This is a strategic policy decision that belongs in the boardroom, not buried in an engineering backlog.

AI is core corporate infrastructure, as consequential to your operational continuity as your ERP or your payments stack. Boards that set a sovereignty threshold now, before an enforcement action forces it, will find that it costs far less to build the architecture than to explain why they didn’t.