The Second-Order AI Trade


I’ve been thinking about one of the more interesting disconnects in public markets today: how investors are valuing IT services and BPO companies in the age of AI. The more I look at it, the more I think the market is asking the right question and arriving at the wrong conclusion.

The question is straightforward. If AI automates knowledge work, what happens to companies whose business model has historically depended on selling that work?

The conclusion the market seems to have reached is equally straightforward. AI replaces labor. IT services sell labor. Therefore IT services become less valuable over time.

There is truth in that argument. AI is already changing how software gets written, how applications are tested, how documentation is produced, and how support teams operate. Anyone who has watched a capable engineer work with modern AI tools knows this isn’t marketing hype anymore. Productivity gains are real and they are compounding.

But I think that line of reasoning stops one step too early.

It captures the first-order effect of AI, which is labor automation. What it misses is what that automation enables at a system level. That is where I think the more interesting question lives.

The work being automated is not the same as the capability being sold

There is certainly a future where enterprises need fewer developers, fewer testers, and fewer support engineers. Companies whose only competitive advantage is supplying lower-cost labor should probably be worried.

Not every services company falls into that category.

The best enterprise services firms have never been valuable because they employ thousands of engineers. They are valuable because they understand environments that outsiders struggle to navigate.

They know which undocumented application still produces a nightly file that another critical system quietly depends on. They know that touching one integration point can break five downstream processes that nobody has looked at in years. They know that the CIO wants transformation, the CFO wants predictability, and the compliance team wants stability unless every audit requirement has been addressed first.

That is not just technical knowledge. It is organizational knowledge accumulated over years of operating inside a client’s environment. It is difficult to capture in a proposal document and even harder to encode into a model.

I am aware this argument has a shelf life. Tools are already emerging that attempt to map legacy dependencies and capture institutional context automatically. Over a long enough horizon they will likely succeed in parts of it. But enterprise transformation has never moved on the timeline that technology capability would suggest, and the window in which this knowledge remains economically valuable is probably longer than the market is currently pricing.

Enterprise complexity has a way of surviving every technology cycle

For years, CIOs, particularly in financial services, have been describing the same objective: retire the mainframe and move everything to the cloud.

Many have made substantial progress. Many are still running critical workloads on systems that were expected to disappear years ago.

This is not because enterprises resist innovation. It is because replacing core operational systems is rarely a technology problem alone.

Every major system sits inside a web of operational dependencies, regulatory obligations, contractual commitments, security controls, and organizational habits. Changing one component often requires changing dozens of others that were never part of the original plan.

AI does not remove that complexity. In many cases it adds another layer to it.

Who is accountable when an AI system makes a recommendation that turns out to be wrong? How should that output be audited? Which data can legally be processed in which jurisdictions? How do organizations monitor hundreds of AI-enabled workflows without introducing entirely new operational risks?

Those questions are becoming more important, not less.

The microservices lesson is worth remembering

When microservices became the dominant architecture for enterprise software, the promise was compelling. Smaller services. Faster releases. Independent teams. Greater flexibility.

Those benefits were real. So were the unintended consequences.

Entire categories of software companies emerged to solve problems that microservices themselves created. Observability platforms, distributed tracing, platform engineering, service meshes, and reliability engineering all became important because managing hundreds of services turned out to be substantially harder than managing one large application.

Automation did not eliminate operational work.It changed where the work lived. AI may follow a similar path.

As the cost of automating business processes falls, organizations are unlikely to automate fewer processes. They are more likely to automate many more. This is Jevons paradox applied to enterprise automation and I don’t think we should dismiss it casually. The difference between enterprise IT and markets where automation genuinely collapsed demand is that the underlying problem space keeps expanding. Travel booking was a fixed market. Enterprise technology complexity is not.

An enterprise managing 500 AI-enabled workflows has a very different operational challenge from one managing 50, even if each individual workflow becomes cheaper to build.

Somebody still has to integrate those systems, monitor them, govern them, secure them, and continuously improve them.

Where I think investors are making the real mistake

The label “IT services” has become too broad to be useful. It groups together companies with fundamentally different trajectories.

Some firms are still competing primarily on labor cost and billing by the hour. Those businesses face genuine structural pressure. If clients need fewer hours and AI keeps driving that number down, there is no natural floor. CIOs will demand those productivity gains come back to them as rate reductions, and they will largely be right to do so.

I should be clear that the shift to outcome-based pricing is not inevitable and not frictionless. Enterprise procurement teams have been buying hours for decades, and outcome-based contracts transfer risk in ways that make buyers cautious. This transition has been discussed for years and has historically moved slowly. What is different now is that pressure on unit economics may become strong enough that both sides are pushed toward new structures. That does not imply speed. It implies necessity over time.

Others are quietly becoming something different.

They are automating their own delivery so repetitive work requires fewer people. They are building repeatable approaches for AI governance, implementation, and compliance. They are accumulating experience from dozens of enterprise AI deployments that clients cannot easily replicate internally.

Most importantly, they are shifting from selling effort to selling outcomes.

That distinction changes the economics in a very specific way. If you charge for hours worked, AI simply reduces the hours available to bill. But if you charge for a business outcome and AI allows that outcome to be delivered with fewer people, revenue stays stable while delivery costs fall. The firm that used to need fifty engineers to fulfill a contract might now need fifteen. The contract value does not change. The margin does.

Not every management team will navigate this well. Some will automate delivery and pass the savings straight through to clients in the form of lower prices, which improves competitiveness but leaves the underlying economics unchanged. The firms worth watching are those that automate delivery and reprice the work at the same time.

From a distance, these two groups still look similar. Over the next several years, I suspect their financial performance will not.

The obvious AI trade has been building intelligence. The less obvious trade may be building the organizations that make intelligence usable inside large enterprises. I suspect the market is still treating those as the same thing.

As usual, these are strictly my personal views.

Stop Choosing Between MCP and CLIs. Build a Harness That Makes Transport Boring.


Most agent infrastructure debates follow a familiar pattern. Teams argue about the protocol, commit to a direction, build toward it for six months, and then discover the real problem was never the protocol at all.

The MCP vs. CLI argument is exactly that debate. I’ve watched it play out at enough teams now to be confident that the ones losing the most time are not losing it because they picked wrong. They’re losing it because both choices, pushed past local iteration and into production platform infrastructure, hit structural walls that a different protocol choice wouldn’t have prevented. The walls are just in different places.

The question worth asking is not which transport to standardize on. It’s why you’re building systems where the transport choice matters to the agent at all.

How We Got Here

When MCP launched, there were genuine reasons to pay attention. Structured schemas, explicit tool definitions, auth flows that could survive a security review. For teams building in regulated environments it gave the governance layer something real to hold onto.

The early experience of actually running it was rough. Client restarts on configuration changes. Custom JSON-RPC server scaffolding for every capability a team wanted to expose. Debugging tooling that barely existed. For developers trying to move fast on local machines, the overhead was hard to justify when the alternative was handing an agent a bash tool and getting immediate results.

So the pendulum swung. Claude Code and Cursor made raw terminal access feel natural, and a new default emerged: give the agent a shell and stay out of the way. For local prototyping it works well. Then teams try to take it to production.

The CLI path has two failure modes that don’t show up until scale. The first is infrastructure cost. Maintaining stateful container sessions for operations that are fundamentally stateless API calls (reading a document, querying a SaaS platform, triggering a pipeline) is heavy engineering for what you’re getting. The second is the attack surface. Indirect prompt injection is not a theoretical concern. It’s a documented attack class where malicious content in a webpage, a file, or an email body manipulates an agent into executing destructive shell commands. Raw terminal access gives that attack vector a very large target.

The MCP path has its own failure mode at scale. Maintaining separate server processes for every capability, including the ones that are thin wrappers around a single REST API, creates operational overhead that compounds as the integration catalog grows. Teams that went all-in on MCP describe the same symptom: more engineering time on integration scaffolding than on the agent logic that was supposed to be the point.

The underlying reality of both paths is the same. Most CLIs are wrapping APIs. Most MCP servers are proxying APIs. The data is identical. The actions are identical. Only the transport layer differs. Once you see that clearly, the debate collapses. You don’t need to pick the right transport. You need an orchestration layer that treats transport as an implementation detail.

What the Harness Needs to Do

The concept is simple to state. Your orchestration layer accepts any capability format, normalizes it, and hands the model a unified tool catalog. MCP server, OpenAPI spec, GraphQL endpoint, REST API, shell script: the harness ingests all of it. The agent never knows or cares what’s executing underneath.

Building it well is a serious engineering investment, and I want to be honest about that because most writing on this topic glosses over the cost. If your team is early-stage and running a handful of integrations, the wrong move is to build this now. Use whatever is fastest. But if you are building foundational enterprise infrastructure, this is not a premature optimization. Even if frontier model context windows become effectively infinite or native tool execution improves, a regulated enterprise can never allow an LLM to directly execute unaudited, non-rate-limited system calls. This abstraction layer isn’t for the model. It’s for the engineers who have to govern the platform. You pay for the abstraction either way. The question is whether you design it or inherit it.

There are five concrete engineering problems the harness has to solve. Each one has a failure mode that bites in production and doesn’t announce itself clearly in a local development environment.

The Tool RAG Problem

The naive approach to agent tooling loads every available tool definition into the context window at startup. At small scale this is fine. Once you’re running dozens of integrations with hundreds of tool definitions, it breaks reasoning and burns tokens before the agent has done anything useful.

The standard fix is dynamic tool retrieval: maintain a semantic index of all registered tools and inject only the definitions relevant to the current task. The agent working a code review task gets GitHub and Jira tool schemas. The same agent answering a question about customer health gets Salesforce and analytics schemas. Same harness, radically different effective tool surface.

The problem with naive semantic retrieval is embedding collapse. Tool schemas are short, highly technical, and domain-specific. Vector embeddings of a custom internal shell script named verify-cluster-state.sh and a standard cloud health check tool may land close enough in embedding space that retrieval pulls the wrong one. The agent calls a tool that isn’t there, or worse, calls the wrong one silently.

The harness needs a two-tier routing system. Deterministic, namespace-based matching handles explicit tool references first and bypasses retrieval entirely. Fuzzy semantic retrieval handles everything else, but that vector index needs to be fed by more than just schema text. Historical tool-calling telemetry, runtime system state, and synthetic intent variations all need to flow into the index to make retrieval reliable on the long tail of real user queries. Building the index is one engineering task. Keeping it calibrated is an ongoing operational one.

For engineers: The fastest way to detect embedding collapse in your tool index is to build an offline eval harness before you go to production. Take a sample of real or synthetic agent intents, run them through retrieval, and manually verify the results. The right k to measure depends on your injection strategy: if you’re doing single-tool selection, precision@1 is what matters; if you’re injecting a ranked set for the model to choose from, measure precision@3. Either way, pay specific attention to internal tools with opaque names. If the correct tool isn’t ranking at the top for your most common intent patterns, you need namespace-based fallback routing before you ship. Retrofitting it after the first production mismatch is significantly more painful.

The Distributed Shell Memory Problem

Treating CLI execution as a stateless API call breaks immediately when an agent does sequential work. The agent runs cd /var/www/html in one turn and npm install in the next. A naive stateless harness executes that second command in the root directory. The failure is silent if you’re not watching closely, and you’re usually not watching closely in production.

The obvious fix, maintaining long-lived stateful container sessions per agent task, creates a different problem. In an auto-scaling environment, sticky sessions pinned to specific gateway instances become architectural bottlenecks. Under load they become single points of failure.

The approach that holds up is to push statefulness into the execution payload, not the infrastructure. The orchestration engine compiles sequential local actions into atomic, multi-command execution blocks (cd /path && npm install). When an agent emits a terminal intent, the engine injects the full environment state into an ephemeral container token for that specific execution. To prevent Turn 2 from running against a blank slate where previous file modifications are missing, the runtime containers attach to a shared persistent volume anchored to the agent’s session ID, whether that’s a ReadWriteMany PVC in a Kubernetes environment, an NVMe-over-Fabrics pool if your infrastructure supports it, or equivalent shared block storage. The compute layer stays stateless and disposable, while the file state follows the session deterministically. To keep this stable under concurrent operations, the harness enforces a serialized FIFO queue per session ID. If an agent requires parallel sub-agents, they spin up isolated sub-sessions with cloned ephemeral volumes rather than competing for a single file lock.

For engineers: The hard part is the handoff when a long-running agent task spans a gateway restart or a deployment. Store environment variables and the current working directory path in your task queue’s state layer or a dedicated Redis key tied to the session ID, not in gateway instance memory. Any state that resets on process restart will surface as a broken terminal session at the worst possible time.

The Safety Isolation Problem

Text-based safety filtering for shell access does not work in production. Blocking rm -rf at the string-matching layer is the start of a cat-and-mouse game that Linux utilities will win. Destructive commands can be expressed through environment variable expansion, base64-encoded payloads, shell aliases, and combinations of individually benign-looking operations. An agent can reach the same destructive outcome through a dozen paths that pass naive text filters.

The harness has to enforce isolation at the kernel level, not the text layer. Sandboxed execution environments using gVisor or WebAssembly runtimes, combined with kernel-level system call monitoring via eBPF, enforce least-privilege Unix permissions at the OS layer. The destructive operation is blocked before it executes, regardless of how it was expressed.

This matters beyond the obvious security case. It’s the only architecture that lets you apply uniform safety policies across all tool types. A CLI command and an MCP server action can be subject to the same policy enforcement because the enforcement happens below the transport layer. A tool tagged destructive: true requires human-in-the-loop confirmation whether it came in as a shell command or a structured MCP call. You write the policy once and it applies everywhere.

For engineers: gVisor latency overhead varies significantly by workload type. For syscall-light tasks, the penalty is often 20 to 30 percent. For I/O-heavy tasks like running package managers or grepping large log volumes, the overhead can reach 2x to 5x in practice, and pre-warmed pools only partially offset that. Profile your specific workloads before committing gVisor across all tool types. A hybrid model, gVisor for high-risk tool categories, standard containers with eBPF monitoring for read-heavy operations, often produces better overall performance than a uniform policy.

The Auth Mapping Problem

CLI tooling handles authentication through local credential files, environment variables, or session state baked into the container. This works on a developer’s laptop. At platform scale it produces a credential management situation that is genuinely hard to audit. When something goes wrong at 2am, “which credential was in scope for that container during that agent turn” is not a question you want to be reconstructing from scattered config files.

The harness handles identity at the platform layer. Enterprise authentication, OIDC, OAuth 2.0, service account tokens, maps to each registered endpoint at registration time. When an agent calls a tool, the harness executes a least-privilege token exchange, taking the enterprise identity token and generating an ephemeral token scoped strictly to the resource, path, and duration required for that single action. The agent never touches credentials directly.

The latency problem with this approach is the synchronous round-trip to the identity provider on every tool call. Against AWS STS or HashiCorp Vault at high call volume, that overhead compounds. The harness needs an encrypted token cache local to the execution gateway, holding ephemeral tokens for the duration of the agent’s task session. When the central identity layer detects a policy violation mid-turn, it broadcasts an invalidation event across the fleet. Because a distributed pub/sub broadcast can fail during a network partition, gateways use pessimistic validation: cached tokens carry ultra-short TTLs, typically under sixty seconds, with a deterministic cryptographic check as the fallback if the gateway loses contact with the auth cluster.

The TTL creates a real problem for long-running agent tasks. A ten-minute task will cycle through several token refreshes mid-turn. Because environment variables are immutable once a Linux process forks, injecting raw credentials at startup means mid-turn refreshes will still fail against the old token. The harness has to handle those refreshes transparently using file-mounted tokens or a local loopback credential proxy inside the sandbox, updating the secret on disk dynamically so active processes can read the fresh token without blocking the agent or passing credentials through the execution context where the model could see them. This needs explicit design, not an afterthought.

For engineers: The audit trail this architecture produces is one of the things that makes the difference in enterprise security reviews. Every tool call is authenticated through a single identity layer, which means every tool call appears in a single audit log regardless of whether the underlying tool is an MCP server, a shell command, or a direct REST call. If you are building for enterprise deployment, design the audit schema first and work backward to the token exchange architecture. The schema you choose determines what questions you can answer after an incident, and you will have incidents.

The Complexity Centralization Problem

This is the one that’s worth being direct about, because it doesn’t appear in most architectural writeups on this topic.

Building a transport-agnostic harness does not eliminate integration complexity. It centralizes it. You are trading distributed technical debt spread across twelve ad-hoc integrations for a single, well-designed system that you now own and must maintain. If the harness has a bug, everything breaks at once. If the schema normalization layer has an edge case, it affects every tool in the catalog simultaneously.

The way to manage this is strict interface contracts between the translation layer and the core orchestration logic. Each transport adapter, OpenAPI, MCP, GraphQL, raw CLI, owns the translation from its native format into the harness’s internal semantic contract. To minimize blast radius, these adapters must be decoupled from the core routing logic as independently deployable, isolated micro-modules. When a specific upstream tool updates its specification, only that micro-adapter changes and rolls out. If an edge case brings down a single adapter, it degrades that capability in isolation and leaves the orchestration core unaffected.

Adapter versioning is necessary but not sufficient. The harder problem is runtime schema drift: the OpenAPI adapter gets updated to v2 of a tool’s spec while the retrieval index was built against v1. The agent gets stale tool definitions injected and either calls with wrong parameters or gets a tool call error that’s difficult to trace back to an index freshness problem. The retrieval index needs to be versioned alongside the adapter registry, with a reconciliation job that detects mismatches and either reindexes or flags the discrepancy before it surfaces as a production failure.

This also means the harness is the right place to invest in observability. Log every tool call with the transport type, the adapter version, the schema version used for retrieval, and the execution result. Loose application logs are not enough here. Tracing an error across an abstraction layer, a local loopback auth proxy, and a gVisor sandbox requires an explicit distributed tracing framework using OpenTelemetry context propagation. The agent session trace ID needs to be injected as an immutable parent span across the entire execution boundary, from the LLM orchestration layer down through the sandbox kernel. Without that telemetry thread, debugging transient latency spikes or timeouts under load is close to impossible.

The Structural Trade-offs

Worth being explicit about what you’re accepting when you build this.

Raw terminal access lets an agent chain complex operations through standard Unix pipes in a single turn. The harness breaks that open-ended composability by design. You’re choosing predictability and policy enforcement over multi-stage shell piping. That’s a conscious trade, not a limitation to engineer around.

You also pay a token cost for version parity. Models carry pre-existing training on common CLIs like kubectl and aws-cli, which means raw terminal execution costs nothing in context overhead for well-known stable tools. But that same training data is the source of hallucinated flag syntax on less common tools or tools that have changed their interface since the training cutoff. Dynamic tool injection via RAG costs tokens on every turn, but it ensures the agent is working from your actual current spec rather than a model’s best guess at it. For internal tooling with no training signal at all, the RAG cost is not a tradeoff. It’s the only way to give the agent accurate tool knowledge.

The distributed vs. centralized failure mode tradeoff is the one teams underestimate most. Fragmented per-team integrations accumulate quietly. The harness breaks loudly, all at once, when something goes wrong. That’s actually the better failure mode for a platform you’re operating at scale, but it requires a different kind of operational discipline than most teams have built before they need it.

The Multi-Tenancy Problem

Most architectural writing on harnesses assumes a single team owns the full tool catalog. At enterprise scale that assumption breaks quickly. When multiple product teams share the same harness, you have three problems that don’t exist in the single-tenant case.

Tool catalog isolation: Team A’s internal Salesforce integration should not be retrievable by Team B’s agent. The retrieval index needs tenant-scoped partitioning with strict, tenant-isolated vector namespaces. Naming conventions drift and get violated; partitioning has to be enforced at the infrastructure layer, not in a shared index where a poorly named tool from one tenant can pollute retrieval results for another. Before tools hit the model’s context window, the translation layer should prepend deterministic tenant prefixes to the tool names themselves (for example, tenantA_get_user). Without explicit namespacing, cross-tenant semantic drift will cause embedding collapse and your dynamic tool RAG will pull the wrong endpoint silently.

Auth boundary enforcement: The least-privilege token exchange described earlier works cleanly in a single-tenant model where one enterprise identity maps to one set of permitted endpoints. In a multi-tenant harness, the identity layer has to enforce that Team A’s agent can only exchange tokens for endpoints registered under Team A’s tenant scope. Cross-tenant tool calls need to be impossible by construction, not by policy documentation that someone has to remember to follow.

The same principle extends to execution isolation. Pre-warmed sandbox pools should remain generic at the infrastructure layer, with tenant identity, volumes, and cryptographic keys hot-swapped in at the moment of execution. This keeps operational readiness decoupled from idle compute costs, and ensures that no tenant’s execution context bleeds into another’s even under concurrent load.

I’ll be direct: multi-tenancy is where the complexity centralization problem compounds most sharply. The patterns above are tractable for a single team. Adding tenancy boundaries across retrieval, auth, execution isolation, and observability is a meaningful additional engineering surface. Teams building toward enterprise platform use should treat multi-tenancy as a first-class design constraint from day one, not a feature to add later.

What Comes Next

The broad adoption of MCP as a packaging standard was a genuinely useful development. Getting the industry to agree on a common way to describe and expose agent tools is valuable infrastructure. But the protocol was never the destination. It’s a packaging format. The destination is an execution environment sophisticated enough to treat packaging as an implementation detail.

The teams I’ve seen pull ahead in agent infrastructure aren’t the ones who made the right protocol call in 2024. Most of them made the same messy choices everyone else did. What separated them was how quickly they made the transport question boring. Once the harness exists, the protocol debate becomes a one-afternoon integration task instead of a six-month architectural commitment. That’s the actual leverage: not the architecture itself, but the organizational freedom it creates. The engineers who were arguing about MCP vs. CLI are now building agent logic. That’s where the compounding starts.

The Builder, the Critic, and the Circuit Breaker: How I’d Design AI Agents That Don’t Bankrupt You


Most multi-agent architectures look elegant on the whiteboard. Agent A generates the output. Agent B judges it against a strict checklist. No AI grading its own homework, clean separation of concerns, autonomous iteration until the job is done.

Then you open the billing dashboard.

Your agents are locked in a loop, Agent A revising, Agent B rejecting, neither making meaningful progress, while your token costs compound by the minute. This failure mode has been documented across production agent deployments at companies ranging from early-stage startups to large enterprises, and it almost never shows up in the architecture review.

The system passes every test in a supervised environment. It breaks in a specific way when nobody is watching.

That distinction matters more than it might seem. We are in the middle of a fundamental shift in how AI is actually deployed, from synchronous AI, where a human waits for a response and can intervene at any point, to unattended autonomy, where agents run complex multi-turn workflows in the background, entirely out of sight. When a human isn’t watching the screen, agents can quietly go to war with each other. You end up funding both sides.

The design constraint that most teams miss isn’t the AI itself. It’s the discipline of using thin, deterministic code to cage non-deterministic intelligence, knowing when to pull the plug, and building the infrastructure to do that gracefully before the token burn compounds.

What follows is the architecture required to enforce boundaries on unattended autonomy, and the reasoning behind every decision.

The Two Failure Modes of Cognitive Deadlock

The two primary failure modes of multi-agent systems are near-opposites, and a naive fix for one reliably triggers the other, trapping your system in a Cognitive Deadlock.

Sycophancy is the first. AI models are naturally drawn to fluent, confident-sounding text. Agent B, your critic, will frequently approve Agent A’s output simply because it reads well on the surface, missing underlying logic errors, hallucinated facts, or reasoning gaps. You think you have a quality gate. You actually have a mutual appreciation society.

Token Churn is what happens when you try to fix that. You instruct Agent B to be harsher: find flaws, if you don’t find one you’ve failed. Now it rejects everything. Agent A revises but hits the ceiling of its own capability, returning a marginally different version. Agent B rejects again.

Software agents have no concept of time, urgency, or money. A human stuck in this loop would eventually say, “I don’t think we’re getting anywhere.” An autonomous agent will simply keep burning tokens indefinitely at your expense.

One nuance worth carrying into the architecture: these failure modes don’t always arrive sequentially. On open-ended analytical or creative tasks, both can coexist within the same loop, Agent B approving weak outputs on some rubric criteria while rigidly rejecting on others. The failure landscape in practice is messier than a clean either/or, and the system needs to account for that from the start.

Why More AI Makes This Worse

The instinct when something breaks in an AI system is to add more AI. Another monitoring model. Another validation layer. Another agent to watch the agents.

This is almost always the wrong move. You’re solving an orchestration problem, which is fundamentally about enforcing hard boundaries, with a tool designed for open-ended reasoning. The result is compounding unpredictability at compounding cost.

The fix is simpler and older: a thin layer of traditional, deterministic code sitting above the models. It doesn’t reason or deliberate. It counts, measures, and cuts. Think of it less like a manager and more like a circuit breaker on an electrical panel, no understanding of what’s flowing through the system, but precise knowledge of when to shut it off.

This is Architectural Minimalism applied to agent systems: the sharpest possible line between what AI calculates and what traditional code enforces. The models handle open-ended reasoning. The orchestration layer holds the edges rigid. Three patterns form the core of that layer, each one a direct response to a failure mode that emerges without it.

Pattern One: The Hard Cap

The simplest and most important constraint: set a maximum iteration count and make it unconditional.

Four to five rounds is the absolute ceiling for enterprise workflows. If the loop hasn’t resolved by then, the agents aren’t converging on an answer, they’re producing token churn. The orchestrator ends the loop regardless of where things stand.

This feels blunt. That’s the point. The value of a hard cap isn’t intelligence, it’s unconditional enforcement. No amount of confident output from either agent can override a counter hitting five.

For engineers: The iteration counter must live in persistent state completely outside the model context, agents have no visibility into it and cannot influence it. On each cycle, before invoking either model, the orchestrator checks the counter and raises a LoopLimitExceeded exception if the threshold is met. This state must survive process restarts; an in-memory counter that resets on an unhandled exception defeats the purpose entirely. Use your task queue’s native state store or a lightweight Redis key with a TTL set conservatively above your maximum expected loop duration. Tag every loop invocation with a correlation ID from initialization, you’ll need it for the observability layer, and retrofitting it later is painful.

Pattern Two: Stagnation Detection

Agents frequently fall into token churn well before hitting the hard cap. Agent A stops making structural changes and starts rewriting surface prose, “furthermore” becomes “in addition,” paragraphs get reorganized without changing substance. From the outside it looks like active progress. It isn’t. And it burns tokens at the same rate as genuine work.

The orchestrator catches this by measuring how much content actually changes between rounds. When the delta drops below a meaningful threshold, it recognizes the agent has exhausted its ideas and breaks the loop early, before the hard cap is reached.

The starting threshold sits at roughly 5%, derived from a straightforward observation: meaningful structural revision typically moves at least 15 to 20% of content. Token churn clusters near zero. The 5% line sits safely in the floor of the gap between them, protecting against slow structural leaks. That said, the right threshold depends on your output type, and the only honest way to calibrate it is to instrument first and tune from data.

For engineers: Choose your similarity metric carefully, token overlap and normalized edit distance behave differently depending on output structure, and picking wrong silently breaks your stagnation detection. For open-ended prose, token overlap is more stable. For structured enterprise outputs, such as JSON schemas, legal clauses, or templated compliance documents, normalized edit distance is more sensitive to meaningful changes.

Calibration requires labeled examples: a set of round-pairs that a domain expert has manually classified as either genuine revision or surface churn. Fifty to one hundred labeled pairs is typically enough to validate your metric choice and threshold. Run both metrics against this set, compare precision and recall, and commit to whichever performs better on your specific output type. This step takes a day and saves weeks of silent miscalibration in production. A threshold that fires too early causes unnecessary rollbacks. One that never fires is invisible until you audit the bill, by which point the cost is already spent.

Enterprise constraint: Running text-distance calculations on large outputs inside the orchestrator can introduce latency. Handle this efficiently inline for outputs under roughly 10,000 tokens, or offload to an async worker for longer documents.

Pattern Three: Temperature Stepping

Temperature controls how deterministic or exploratory a model’s output is. Low temperature produces focused, precise responses. High temperature introduces variance, and occasionally breaks a model out of a logical rut it cannot escape through refinement alone.

Rather than holding temperature constant, the orchestrator steps it based on loop position:

  • Rounds 1–2: Low temperature. Focused, precise output.
  • Round 3: The orchestrator injects a direct intervention prepended to the system message: You have been rejected twice. Do not refine your previous approach, abandon it and try something structurally different.
  • Round 4: Temperature is increased, forcing genuine exploration rather than iteration on a failing strategy.
  • Round 5: Hard cut.

The architecture mirrors a core tenet of classical optimization: a system stuck in a local minimum needs a controlled injection of variance to escape. That is the logic behind simulated annealing, and it applies meaningfully to language model behavior. A model stuck at low temperature on a problem it cannot solve will keep producing the same wrong answer with increasing confidence. The temperature bump is a last resort before human escalation, not guaranteed to work, but cheap enough to always be worth attempting.

For engineers: Set temperature at the API call layer, not within the prompt. The Round 3 system message injection should be prepended, not appended, since position affects attention weighting in current transformer architectures. Log temperature values alongside each draft in your run record; when debugging a failed loop, temperature trajectory is often the fastest signal for distinguishing genuine exploration from a stuck model.

Enterprise constraint: Several corporate API gateways restrict per-call temperature adjustments. If that’s your environment, substitute Prompt Hardening: at Round 3, replace the system prompt entirely, switching from an open-ended coaching prompt to an aggressive few-shot prompt that mandates strict structural compliance. The mechanism differs; the intent is identical.

Instrumentation: The Operational Audit Trail

Treat observability as a first-class design concern, not something to retrofit after the first production incident.

Without it, the circuit breaker layer is a black box. You know it fired. You don’t know which pattern triggered it, how often, on what input types, or whether your thresholds are correctly calibrated. You cannot improve what you cannot see, and in an unattended system, you may not notice what’s wrong until the bill arrives.

Log the following for every loop run: correlation ID, total iterations, exit reason (hard cap, stagnation, success, or human escalation), delta score per round, Agent B rubric result per round, temperature per round, and whether a best-effort draft was delivered or a handoff was triggered.

These logs should aggregate into a simple operational dashboard displaying iteration count distribution, circuit break frequency by type, rollback rate, and handoff rate. Within a few hundred runs you’ll have enough signal to tune thresholds from data rather than intuition. This log also becomes your audit trail when a stakeholder asks why a specific high-value output was delivered as a partial draft rather than a completed one.

When the Circuit Breaks: Delivering Something Useful

Stopping the loop prevents runaway cost. But there’s still a user waiting for a result, and “the AI gave up” is not a product experience. Graceful failure is an engineering requirement, not an afterthought.

Save the best draft, not the last one. The orchestrator caches every draft alongside the structured rubric score Agent B assigned it. If Round 2 produced a draft passing 85% of quality criteria and later rounds failed to improve on it, the system rolls back and delivers Round 2. The user gets an imperfect but usable result instead of an error screen. In most business contexts, 85% complete is a workable starting point, not a failure.

That percentage is only meaningful if the rubric behind it is well-defined. Agent B’s checklist needs explicit, binary criteria, not is this good? but does this section contain a pricing breakdown? Yes or no. The score is passed criteria divided by total criteria.

One example of a criterion that looks binary but isn’t: “is the tone professional?” That’s a judgment call dressed as a checkbox. Two evaluations of the same document will produce different scores, which corrupts your rollback logic and makes your quality threshold meaningless. The test for a valid rubric criterion is whether two different reviewers, given the same document, would independently reach the same answer. If they wouldn’t, decompose the criterion until they would, before it touches production.

Communicate failures in human terms. If no draft clears the minimum threshold, the response should be specific and actionable: We generated 80% of your proposal but encountered a conflict in the pricing section. The draft has been saved and flagged for review. Structure this as a typed failure object that downstream systems can parse and route, not a string that gets logged and forgotten.

Hand off to a human with full context. When the loop breaks, the orchestrator packages the original prompt, the highest-scoring draft, the specific failed rubric criteria from Agent B, and the execution trace. A human reviewer sees precisely where the AI got stuck, fixes that specific gap, and approves. Targeted human judgment applied at the exact point of failure, not a restart from scratch.

The Enterprise Agent Paradox

This handoff introduces the sharpest organizational challenge in agentic system design, and it’s one the architecture alone cannot solve.

If an autonomous agent circuit-breaks on a high-value client document, who gets the notification? What is their SLA to respond? If a human reviewer takes four hours to log in, review, and patch the 15% gap the AI couldn’t close, has your unattended autonomy actually saved the enterprise any time, or did it just shift the operational friction downstream?

The bottleneck of an agentic system is rarely the AI. It is your human routing architecture.

The practical resolution is Omnichannel Exception Routing: do not build custom review interfaces from scratch. Export standard schemas, such as OpenTelemetry events or typed webhooks, that inject circuit-break exceptions directly into your existing enterprise ticketing infrastructure, like ServiceNow, Jira, or whatever queue your team already monitors. The human side of the handoff needs to live where human attention already lives, governed by SLAs that already exist.

Building a parallel review system is an organizational change management problem disguised as an engineering task. Teams that treat it as purely technical consistently underestimate the timeline by a factor of three. Plan for that gap explicitly; the technical handoff can be built in a day, but getting the human side right typically takes weeks and involves stakeholders well outside the engineering team.

The Cost Case for Building This Upfront

A loop running to the hard cap consumes roughly five to ten times the tokens of a successful two-round completion on the same task. At scale, with hundreds or thousands of daily agent invocations, uncontrolled loops don’t just create poor user experiences. They create billing events that compound faster than most teams notice before the monthly statement arrives.

The orchestration layer described here adds minimal compute overhead. The logic is deterministic code running entirely outside the model context. The investment is engineering time, specifically days to weeks to build and calibrate. The alternative is discovering your failure thresholds the hard way, at production volume, on your cloud bill.

Build the circuit breaker before you need it. By the time you need it, you’re already paying for not having it.

The Principle Underneath the Architecture

Every instinct in agent system design pulls toward more intelligence, a smarter critic, a more capable builder, another layer of AI judgment applied wherever the last layer fell short. That instinct is understandable and almost always counterproductive.

The pattern that holds up under pressure is a strict division of labor: AI handles the open-ended reasoning, traditional code enforces the boundaries, existing human workflows handle the exceptions. Not as a fallback for a broken system, but as a deliberate architectural choice in a working one.

There’s a counterintuitive implication worth carrying forward: as AI agents become more capable, the orchestration layer becomes more important, not less. A more capable agent running in an uncontrolled loop causes more damage, faster, at greater cost than a weaker one. The sophistication of the AI and the rigidity of its constraints need to scale together. Every increase in autonomy is an argument for a more robust circuit breaker, not a less necessary one.

Autonomy without boundaries isn’t architectural minimalism. It’s just risk with a better interface.

Build the builder. Build the critic. Make sure something that doesn’t think is always in charge of knowing when to stop.