The Builder, the Critic, and the Circuit Breaker: How I’d Design AI Agents That Don’t Bankrupt You

Most multi-agent architectures look elegant on the whiteboard. Agent A generates the output. Agent B judges it against a strict checklist. No AI grading its own homework, clean separation of concerns, autonomous iteration until the job is done.

Then you open the billing dashboard.

Your agents are locked in a loop, Agent A revising, Agent B rejecting, neither making meaningful progress, while your token costs compound by the minute. This failure mode has been documented across production agent deployments at companies ranging from early-stage startups to large enterprises, and it almost never shows up in the architecture review.

The system passes every test in a supervised environment. It breaks in a specific way when nobody is watching.

That distinction matters more than it might seem. We are in the middle of a fundamental shift in how AI is actually deployed, from synchronous AI, where a human waits for a response and can intervene at any point, to unattended autonomy, where agents run complex multi-turn workflows in the background, entirely out of sight. When a human isn’t watching the screen, agents can quietly go to war with each other. You end up funding both sides.

The design constraint that most teams miss isn’t the AI itself. It’s the discipline of using thin, deterministic code to cage non-deterministic intelligence, knowing when to pull the plug, and building the infrastructure to do that gracefully before the token burn compounds.

What follows is the architecture required to enforce boundaries on unattended autonomy, and the reasoning behind every decision.

The Two Failure Modes of Cognitive Deadlock

The two primary failure modes of multi-agent systems are near-opposites, and a naive fix for one reliably triggers the other, trapping your system in a Cognitive Deadlock.

Sycophancy is the first. AI models are naturally drawn to fluent, confident-sounding text. Agent B, your critic, will frequently approve Agent A’s output simply because it reads well on the surface, missing underlying logic errors, hallucinated facts, or reasoning gaps. You think you have a quality gate. You actually have a mutual appreciation society.

Token Churn is what happens when you try to fix that. You instruct Agent B to be harsher: find flaws, if you don’t find one you’ve failed. Now it rejects everything. Agent A revises but hits the ceiling of its own capability, returning a marginally different version. Agent B rejects again.

Software agents have no concept of time, urgency, or money. A human stuck in this loop would eventually say, “I don’t think we’re getting anywhere.” An autonomous agent will simply keep burning tokens indefinitely at your expense.

One nuance worth carrying into the architecture: these failure modes don’t always arrive sequentially. On open-ended analytical or creative tasks, both can coexist within the same loop, Agent B approving weak outputs on some rubric criteria while rigidly rejecting on others. The failure landscape in practice is messier than a clean either/or, and the system needs to account for that from the start.

Why More AI Makes This Worse

The instinct when something breaks in an AI system is to add more AI. Another monitoring model. Another validation layer. Another agent to watch the agents.

This is almost always the wrong move. You’re solving an orchestration problem, which is fundamentally about enforcing hard boundaries, with a tool designed for open-ended reasoning. The result is compounding unpredictability at compounding cost.

The fix is simpler and older: a thin layer of traditional, deterministic code sitting above the models. It doesn’t reason or deliberate. It counts, measures, and cuts. Think of it less like a manager and more like a circuit breaker on an electrical panel, no understanding of what’s flowing through the system, but precise knowledge of when to shut it off.

This is Architectural Minimalism applied to agent systems: the sharpest possible line between what AI calculates and what traditional code enforces. The models handle open-ended reasoning. The orchestration layer holds the edges rigid. Three patterns form the core of that layer, each one a direct response to a failure mode that emerges without it.

Pattern One: The Hard Cap

The simplest and most important constraint: set a maximum iteration count and make it unconditional.

Four to five rounds is the absolute ceiling for enterprise workflows. If the loop hasn’t resolved by then, the agents aren’t converging on an answer, they’re producing token churn. The orchestrator ends the loop regardless of where things stand.

This feels blunt. That’s the point. The value of a hard cap isn’t intelligence, it’s unconditional enforcement. No amount of confident output from either agent can override a counter hitting five.

For engineers: The iteration counter must live in persistent state completely outside the model context, agents have no visibility into it and cannot influence it. On each cycle, before invoking either model, the orchestrator checks the counter and raises a LoopLimitExceeded exception if the threshold is met. This state must survive process restarts; an in-memory counter that resets on an unhandled exception defeats the purpose entirely. Use your task queue’s native state store or a lightweight Redis key with a TTL set conservatively above your maximum expected loop duration. Tag every loop invocation with a correlation ID from initialization, you’ll need it for the observability layer, and retrofitting it later is painful.

Pattern Two: Stagnation Detection

Agents frequently fall into token churn well before hitting the hard cap. Agent A stops making structural changes and starts rewriting surface prose, “furthermore” becomes “in addition,” paragraphs get reorganized without changing substance. From the outside it looks like active progress. It isn’t. And it burns tokens at the same rate as genuine work.

The orchestrator catches this by measuring how much content actually changes between rounds. When the delta drops below a meaningful threshold, it recognizes the agent has exhausted its ideas and breaks the loop early, before the hard cap is reached.

The starting threshold sits at roughly 5%, derived from a straightforward observation: meaningful structural revision typically moves at least 15 to 20% of content. Token churn clusters near zero. The 5% line sits safely in the floor of the gap between them, protecting against slow structural leaks. That said, the right threshold depends on your output type, and the only honest way to calibrate it is to instrument first and tune from data.

For engineers: Choose your similarity metric carefully, token overlap and normalized edit distance behave differently depending on output structure, and picking wrong silently breaks your stagnation detection. For open-ended prose, token overlap is more stable. For structured enterprise outputs, such as JSON schemas, legal clauses, or templated compliance documents, normalized edit distance is more sensitive to meaningful changes.

Calibration requires labeled examples: a set of round-pairs that a domain expert has manually classified as either genuine revision or surface churn. Fifty to one hundred labeled pairs is typically enough to validate your metric choice and threshold. Run both metrics against this set, compare precision and recall, and commit to whichever performs better on your specific output type. This step takes a day and saves weeks of silent miscalibration in production. A threshold that fires too early causes unnecessary rollbacks. One that never fires is invisible until you audit the bill, by which point the cost is already spent.

Enterprise constraint: Running text-distance calculations on large outputs inside the orchestrator can introduce latency. Handle this efficiently inline for outputs under roughly 10,000 tokens, or offload to an async worker for longer documents.

Pattern Three: Temperature Stepping

Temperature controls how deterministic or exploratory a model’s output is. Low temperature produces focused, precise responses. High temperature introduces variance, and occasionally breaks a model out of a logical rut it cannot escape through refinement alone.

Rather than holding temperature constant, the orchestrator steps it based on loop position:

Rounds 1–2: Low temperature. Focused, precise output.
Round 3: The orchestrator injects a direct intervention prepended to the system message: You have been rejected twice. Do not refine your previous approach, abandon it and try something structurally different.
Round 4: Temperature is increased, forcing genuine exploration rather than iteration on a failing strategy.
Round 5: Hard cut.

The architecture mirrors a core tenet of classical optimization: a system stuck in a local minimum needs a controlled injection of variance to escape. That is the logic behind simulated annealing, and it applies meaningfully to language model behavior. A model stuck at low temperature on a problem it cannot solve will keep producing the same wrong answer with increasing confidence. The temperature bump is a last resort before human escalation, not guaranteed to work, but cheap enough to always be worth attempting.

For engineers: Set temperature at the API call layer, not within the prompt. The Round 3 system message injection should be prepended, not appended, since position affects attention weighting in current transformer architectures. Log temperature values alongside each draft in your run record; when debugging a failed loop, temperature trajectory is often the fastest signal for distinguishing genuine exploration from a stuck model.

Enterprise constraint: Several corporate API gateways restrict per-call temperature adjustments. If that’s your environment, substitute Prompt Hardening: at Round 3, replace the system prompt entirely, switching from an open-ended coaching prompt to an aggressive few-shot prompt that mandates strict structural compliance. The mechanism differs; the intent is identical.

Instrumentation: The Operational Audit Trail

Treat observability as a first-class design concern, not something to retrofit after the first production incident.

Without it, the circuit breaker layer is a black box. You know it fired. You don’t know which pattern triggered it, how often, on what input types, or whether your thresholds are correctly calibrated. You cannot improve what you cannot see, and in an unattended system, you may not notice what’s wrong until the bill arrives.

Log the following for every loop run: correlation ID, total iterations, exit reason (hard cap, stagnation, success, or human escalation), delta score per round, Agent B rubric result per round, temperature per round, and whether a best-effort draft was delivered or a handoff was triggered.

These logs should aggregate into a simple operational dashboard displaying iteration count distribution, circuit break frequency by type, rollback rate, and handoff rate. Within a few hundred runs you’ll have enough signal to tune thresholds from data rather than intuition. This log also becomes your audit trail when a stakeholder asks why a specific high-value output was delivered as a partial draft rather than a completed one.

When the Circuit Breaks: Delivering Something Useful

Stopping the loop prevents runaway cost. But there’s still a user waiting for a result, and “the AI gave up” is not a product experience. Graceful failure is an engineering requirement, not an afterthought.

Save the best draft, not the last one. The orchestrator caches every draft alongside the structured rubric score Agent B assigned it. If Round 2 produced a draft passing 85% of quality criteria and later rounds failed to improve on it, the system rolls back and delivers Round 2. The user gets an imperfect but usable result instead of an error screen. In most business contexts, 85% complete is a workable starting point, not a failure.

That percentage is only meaningful if the rubric behind it is well-defined. Agent B’s checklist needs explicit, binary criteria, not is this good? but does this section contain a pricing breakdown? Yes or no. The score is passed criteria divided by total criteria.

One example of a criterion that looks binary but isn’t: “is the tone professional?” That’s a judgment call dressed as a checkbox. Two evaluations of the same document will produce different scores, which corrupts your rollback logic and makes your quality threshold meaningless. The test for a valid rubric criterion is whether two different reviewers, given the same document, would independently reach the same answer. If they wouldn’t, decompose the criterion until they would, before it touches production.

Communicate failures in human terms. If no draft clears the minimum threshold, the response should be specific and actionable: We generated 80% of your proposal but encountered a conflict in the pricing section. The draft has been saved and flagged for review. Structure this as a typed failure object that downstream systems can parse and route, not a string that gets logged and forgotten.

Hand off to a human with full context. When the loop breaks, the orchestrator packages the original prompt, the highest-scoring draft, the specific failed rubric criteria from Agent B, and the execution trace. A human reviewer sees precisely where the AI got stuck, fixes that specific gap, and approves. Targeted human judgment applied at the exact point of failure, not a restart from scratch.

The Enterprise Agent Paradox

This handoff introduces the sharpest organizational challenge in agentic system design, and it’s one the architecture alone cannot solve.

If an autonomous agent circuit-breaks on a high-value client document, who gets the notification? What is their SLA to respond? If a human reviewer takes four hours to log in, review, and patch the 15% gap the AI couldn’t close, has your unattended autonomy actually saved the enterprise any time, or did it just shift the operational friction downstream?

The bottleneck of an agentic system is rarely the AI. It is your human routing architecture.

The practical resolution is Omnichannel Exception Routing: do not build custom review interfaces from scratch. Export standard schemas, such as OpenTelemetry events or typed webhooks, that inject circuit-break exceptions directly into your existing enterprise ticketing infrastructure, like ServiceNow, Jira, or whatever queue your team already monitors. The human side of the handoff needs to live where human attention already lives, governed by SLAs that already exist.

Building a parallel review system is an organizational change management problem disguised as an engineering task. Teams that treat it as purely technical consistently underestimate the timeline by a factor of three. Plan for that gap explicitly; the technical handoff can be built in a day, but getting the human side right typically takes weeks and involves stakeholders well outside the engineering team.

The Cost Case for Building This Upfront

A loop running to the hard cap consumes roughly five to ten times the tokens of a successful two-round completion on the same task. At scale, with hundreds or thousands of daily agent invocations, uncontrolled loops don’t just create poor user experiences. They create billing events that compound faster than most teams notice before the monthly statement arrives.

The orchestration layer described here adds minimal compute overhead. The logic is deterministic code running entirely outside the model context. The investment is engineering time, specifically days to weeks to build and calibrate. The alternative is discovering your failure thresholds the hard way, at production volume, on your cloud bill.

Build the circuit breaker before you need it. By the time you need it, you’re already paying for not having it.

The Principle Underneath the Architecture

Every instinct in agent system design pulls toward more intelligence, a smarter critic, a more capable builder, another layer of AI judgment applied wherever the last layer fell short. That instinct is understandable and almost always counterproductive.

The pattern that holds up under pressure is a strict division of labor: AI handles the open-ended reasoning, traditional code enforces the boundaries, existing human workflows handle the exceptions. Not as a fallback for a broken system, but as a deliberate architectural choice in a working one.

There’s a counterintuitive implication worth carrying forward: as AI agents become more capable, the orchestration layer becomes more important, not less. A more capable agent running in an uncontrolled loop causes more damage, faster, at greater cost than a weaker one. The sophistication of the AI and the rigidity of its constraints need to scale together. Every increase in autonomy is an argument for a more robust circuit breaker, not a less necessary one.

Autonomy without boundaries isn’t architectural minimalism. It’s just risk with a better interface.

Build the builder. Build the critic. Make sure something that doesn’t think is always in charge of knowing when to stop.

The Builder, the Critic, and the Circuit Breaker: How I’d Design AI Agents That Don’t Bankrupt You

The Two Failure Modes of Cognitive Deadlock

Why More AI Makes This Worse

Pattern One: The Hard Cap

Pattern Two: Stagnation Detection

Pattern Three: Temperature Stepping

Instrumentation: The Operational Audit Trail

When the Circuit Breaks: Delivering Something Useful

The Enterprise Agent Paradox

The Cost Case for Building This Upfront

The Principle Underneath the Architecture

Published by Vijay Vijayasankar

Leave a comment Cancel reply

The Two Failure Modes of Cognitive Deadlock

Why More AI Makes This Worse

Pattern One: The Hard Cap

Pattern Two: Stagnation Detection

Pattern Three: Temperature Stepping

Instrumentation: The Operational Audit Trail

When the Circuit Breaks: Delivering Something Useful

The Enterprise Agent Paradox

The Cost Case for Building This Upfront

The Principle Underneath the Architecture

Share this:

Related

Published by Vijay Vijayasankar

Leave a comment Cancel reply