Stop Choosing Between MCP and CLIs. Build a Harness That Makes Transport Boring.

Most agent infrastructure debates follow a familiar pattern. Teams argue about the protocol, commit to a direction, build toward it for six months, and then discover the real problem was never the protocol at all.

The MCP vs. CLI argument is exactly that debate. I’ve watched it play out at enough teams now to be confident that the ones losing the most time are not losing it because they picked wrong. They’re losing it because both choices, pushed past local iteration and into production platform infrastructure, hit structural walls that a different protocol choice wouldn’t have prevented. The walls are just in different places.

The question worth asking is not which transport to standardize on. It’s why you’re building systems where the transport choice matters to the agent at all.

How We Got Here

When MCP launched, there were genuine reasons to pay attention. Structured schemas, explicit tool definitions, auth flows that could survive a security review. For teams building in regulated environments it gave the governance layer something real to hold onto.

The early experience of actually running it was rough. Client restarts on configuration changes. Custom JSON-RPC server scaffolding for every capability a team wanted to expose. Debugging tooling that barely existed. For developers trying to move fast on local machines, the overhead was hard to justify when the alternative was handing an agent a bash tool and getting immediate results.

So the pendulum swung. Claude Code and Cursor made raw terminal access feel natural, and a new default emerged: give the agent a shell and stay out of the way. For local prototyping it works well. Then teams try to take it to production.

The CLI path has two failure modes that don’t show up until scale. The first is infrastructure cost. Maintaining stateful container sessions for operations that are fundamentally stateless API calls (reading a document, querying a SaaS platform, triggering a pipeline) is heavy engineering for what you’re getting. The second is the attack surface. Indirect prompt injection is not a theoretical concern. It’s a documented attack class where malicious content in a webpage, a file, or an email body manipulates an agent into executing destructive shell commands. Raw terminal access gives that attack vector a very large target.

The MCP path has its own failure mode at scale. Maintaining separate server processes for every capability, including the ones that are thin wrappers around a single REST API, creates operational overhead that compounds as the integration catalog grows. Teams that went all-in on MCP describe the same symptom: more engineering time on integration scaffolding than on the agent logic that was supposed to be the point.

The underlying reality of both paths is the same. Most CLIs are wrapping APIs. Most MCP servers are proxying APIs. The data is identical. The actions are identical. Only the transport layer differs. Once you see that clearly, the debate collapses. You don’t need to pick the right transport. You need an orchestration layer that treats transport as an implementation detail.

What the Harness Needs to Do

The concept is simple to state. Your orchestration layer accepts any capability format, normalizes it, and hands the model a unified tool catalog. MCP server, OpenAPI spec, GraphQL endpoint, REST API, shell script: the harness ingests all of it. The agent never knows or cares what’s executing underneath.

Building it well is a serious engineering investment, and I want to be honest about that because most writing on this topic glosses over the cost. If your team is early-stage and running a handful of integrations, the wrong move is to build this now. Use whatever is fastest. But if you are building foundational enterprise infrastructure, this is not a premature optimization. Even if frontier model context windows become effectively infinite or native tool execution improves, a regulated enterprise can never allow an LLM to directly execute unaudited, non-rate-limited system calls. This abstraction layer isn’t for the model. It’s for the engineers who have to govern the platform. You pay for the abstraction either way. The question is whether you design it or inherit it.

There are five concrete engineering problems the harness has to solve. Each one has a failure mode that bites in production and doesn’t announce itself clearly in a local development environment.

The Tool RAG Problem

The naive approach to agent tooling loads every available tool definition into the context window at startup. At small scale this is fine. Once you’re running dozens of integrations with hundreds of tool definitions, it breaks reasoning and burns tokens before the agent has done anything useful.

The standard fix is dynamic tool retrieval: maintain a semantic index of all registered tools and inject only the definitions relevant to the current task. The agent working a code review task gets GitHub and Jira tool schemas. The same agent answering a question about customer health gets Salesforce and analytics schemas. Same harness, radically different effective tool surface.

The problem with naive semantic retrieval is embedding collapse. Tool schemas are short, highly technical, and domain-specific. Vector embeddings of a custom internal shell script named verify-cluster-state.sh and a standard cloud health check tool may land close enough in embedding space that retrieval pulls the wrong one. The agent calls a tool that isn’t there, or worse, calls the wrong one silently.

The harness needs a two-tier routing system. Deterministic, namespace-based matching handles explicit tool references first and bypasses retrieval entirely. Fuzzy semantic retrieval handles everything else, but that vector index needs to be fed by more than just schema text. Historical tool-calling telemetry, runtime system state, and synthetic intent variations all need to flow into the index to make retrieval reliable on the long tail of real user queries. Building the index is one engineering task. Keeping it calibrated is an ongoing operational one.

For engineers: The fastest way to detect embedding collapse in your tool index is to build an offline eval harness before you go to production. Take a sample of real or synthetic agent intents, run them through retrieval, and manually verify the results. The right k to measure depends on your injection strategy: if you’re doing single-tool selection, precision@1 is what matters; if you’re injecting a ranked set for the model to choose from, measure precision@3. Either way, pay specific attention to internal tools with opaque names. If the correct tool isn’t ranking at the top for your most common intent patterns, you need namespace-based fallback routing before you ship. Retrofitting it after the first production mismatch is significantly more painful.

The Distributed Shell Memory Problem

Treating CLI execution as a stateless API call breaks immediately when an agent does sequential work. The agent runs cd /var/www/html in one turn and npm install in the next. A naive stateless harness executes that second command in the root directory. The failure is silent if you’re not watching closely, and you’re usually not watching closely in production.

The obvious fix, maintaining long-lived stateful container sessions per agent task, creates a different problem. In an auto-scaling environment, sticky sessions pinned to specific gateway instances become architectural bottlenecks. Under load they become single points of failure.

The approach that holds up is to push statefulness into the execution payload, not the infrastructure. The orchestration engine compiles sequential local actions into atomic, multi-command execution blocks (cd /path && npm install). When an agent emits a terminal intent, the engine injects the full environment state into an ephemeral container token for that specific execution. To prevent Turn 2 from running against a blank slate where previous file modifications are missing, the runtime containers attach to a shared persistent volume anchored to the agent’s session ID, whether that’s a ReadWriteMany PVC in a Kubernetes environment, an NVMe-over-Fabrics pool if your infrastructure supports it, or equivalent shared block storage. The compute layer stays stateless and disposable, while the file state follows the session deterministically. To keep this stable under concurrent operations, the harness enforces a serialized FIFO queue per session ID. If an agent requires parallel sub-agents, they spin up isolated sub-sessions with cloned ephemeral volumes rather than competing for a single file lock.

For engineers: The hard part is the handoff when a long-running agent task spans a gateway restart or a deployment. Store environment variables and the current working directory path in your task queue’s state layer or a dedicated Redis key tied to the session ID, not in gateway instance memory. Any state that resets on process restart will surface as a broken terminal session at the worst possible time.

The Safety Isolation Problem

Text-based safety filtering for shell access does not work in production. Blocking rm -rf at the string-matching layer is the start of a cat-and-mouse game that Linux utilities will win. Destructive commands can be expressed through environment variable expansion, base64-encoded payloads, shell aliases, and combinations of individually benign-looking operations. An agent can reach the same destructive outcome through a dozen paths that pass naive text filters.

The harness has to enforce isolation at the kernel level, not the text layer. Sandboxed execution environments using gVisor or WebAssembly runtimes, combined with kernel-level system call monitoring via eBPF, enforce least-privilege Unix permissions at the OS layer. The destructive operation is blocked before it executes, regardless of how it was expressed.

This matters beyond the obvious security case. It’s the only architecture that lets you apply uniform safety policies across all tool types. A CLI command and an MCP server action can be subject to the same policy enforcement because the enforcement happens below the transport layer. A tool tagged destructive: true requires human-in-the-loop confirmation whether it came in as a shell command or a structured MCP call. You write the policy once and it applies everywhere.

For engineers: gVisor latency overhead varies significantly by workload type. For syscall-light tasks, the penalty is often 20 to 30 percent. For I/O-heavy tasks like running package managers or grepping large log volumes, the overhead can reach 2x to 5x in practice, and pre-warmed pools only partially offset that. Profile your specific workloads before committing gVisor across all tool types. A hybrid model, gVisor for high-risk tool categories, standard containers with eBPF monitoring for read-heavy operations, often produces better overall performance than a uniform policy.

The Auth Mapping Problem

CLI tooling handles authentication through local credential files, environment variables, or session state baked into the container. This works on a developer’s laptop. At platform scale it produces a credential management situation that is genuinely hard to audit. When something goes wrong at 2am, “which credential was in scope for that container during that agent turn” is not a question you want to be reconstructing from scattered config files.

The harness handles identity at the platform layer. Enterprise authentication, OIDC, OAuth 2.0, service account tokens, maps to each registered endpoint at registration time. When an agent calls a tool, the harness executes a least-privilege token exchange, taking the enterprise identity token and generating an ephemeral token scoped strictly to the resource, path, and duration required for that single action. The agent never touches credentials directly.

The latency problem with this approach is the synchronous round-trip to the identity provider on every tool call. Against AWS STS or HashiCorp Vault at high call volume, that overhead compounds. The harness needs an encrypted token cache local to the execution gateway, holding ephemeral tokens for the duration of the agent’s task session. When the central identity layer detects a policy violation mid-turn, it broadcasts an invalidation event across the fleet. Because a distributed pub/sub broadcast can fail during a network partition, gateways use pessimistic validation: cached tokens carry ultra-short TTLs, typically under sixty seconds, with a deterministic cryptographic check as the fallback if the gateway loses contact with the auth cluster.

The TTL creates a real problem for long-running agent tasks. A ten-minute task will cycle through several token refreshes mid-turn. Because environment variables are immutable once a Linux process forks, injecting raw credentials at startup means mid-turn refreshes will still fail against the old token. The harness has to handle those refreshes transparently using file-mounted tokens or a local loopback credential proxy inside the sandbox, updating the secret on disk dynamically so active processes can read the fresh token without blocking the agent or passing credentials through the execution context where the model could see them. This needs explicit design, not an afterthought.

For engineers: The audit trail this architecture produces is one of the things that makes the difference in enterprise security reviews. Every tool call is authenticated through a single identity layer, which means every tool call appears in a single audit log regardless of whether the underlying tool is an MCP server, a shell command, or a direct REST call. If you are building for enterprise deployment, design the audit schema first and work backward to the token exchange architecture. The schema you choose determines what questions you can answer after an incident, and you will have incidents.

The Complexity Centralization Problem

This is the one that’s worth being direct about, because it doesn’t appear in most architectural writeups on this topic.

Building a transport-agnostic harness does not eliminate integration complexity. It centralizes it. You are trading distributed technical debt spread across twelve ad-hoc integrations for a single, well-designed system that you now own and must maintain. If the harness has a bug, everything breaks at once. If the schema normalization layer has an edge case, it affects every tool in the catalog simultaneously.

The way to manage this is strict interface contracts between the translation layer and the core orchestration logic. Each transport adapter, OpenAPI, MCP, GraphQL, raw CLI, owns the translation from its native format into the harness’s internal semantic contract. To minimize blast radius, these adapters must be decoupled from the core routing logic as independently deployable, isolated micro-modules. When a specific upstream tool updates its specification, only that micro-adapter changes and rolls out. If an edge case brings down a single adapter, it degrades that capability in isolation and leaves the orchestration core unaffected.

Adapter versioning is necessary but not sufficient. The harder problem is runtime schema drift: the OpenAPI adapter gets updated to v2 of a tool’s spec while the retrieval index was built against v1. The agent gets stale tool definitions injected and either calls with wrong parameters or gets a tool call error that’s difficult to trace back to an index freshness problem. The retrieval index needs to be versioned alongside the adapter registry, with a reconciliation job that detects mismatches and either reindexes or flags the discrepancy before it surfaces as a production failure.

This also means the harness is the right place to invest in observability. Log every tool call with the transport type, the adapter version, the schema version used for retrieval, and the execution result. Loose application logs are not enough here. Tracing an error across an abstraction layer, a local loopback auth proxy, and a gVisor sandbox requires an explicit distributed tracing framework using OpenTelemetry context propagation. The agent session trace ID needs to be injected as an immutable parent span across the entire execution boundary, from the LLM orchestration layer down through the sandbox kernel. Without that telemetry thread, debugging transient latency spikes or timeouts under load is close to impossible.

The Structural Trade-offs

Worth being explicit about what you’re accepting when you build this.

Raw terminal access lets an agent chain complex operations through standard Unix pipes in a single turn. The harness breaks that open-ended composability by design. You’re choosing predictability and policy enforcement over multi-stage shell piping. That’s a conscious trade, not a limitation to engineer around.

You also pay a token cost for version parity. Models carry pre-existing training on common CLIs like kubectl and aws-cli, which means raw terminal execution costs nothing in context overhead for well-known stable tools. But that same training data is the source of hallucinated flag syntax on less common tools or tools that have changed their interface since the training cutoff. Dynamic tool injection via RAG costs tokens on every turn, but it ensures the agent is working from your actual current spec rather than a model’s best guess at it. For internal tooling with no training signal at all, the RAG cost is not a tradeoff. It’s the only way to give the agent accurate tool knowledge.

The distributed vs. centralized failure mode tradeoff is the one teams underestimate most. Fragmented per-team integrations accumulate quietly. The harness breaks loudly, all at once, when something goes wrong. That’s actually the better failure mode for a platform you’re operating at scale, but it requires a different kind of operational discipline than most teams have built before they need it.

The Multi-Tenancy Problem

Most architectural writing on harnesses assumes a single team owns the full tool catalog. At enterprise scale that assumption breaks quickly. When multiple product teams share the same harness, you have three problems that don’t exist in the single-tenant case.

Tool catalog isolation: Team A’s internal Salesforce integration should not be retrievable by Team B’s agent. The retrieval index needs tenant-scoped partitioning with strict, tenant-isolated vector namespaces. Naming conventions drift and get violated; partitioning has to be enforced at the infrastructure layer, not in a shared index where a poorly named tool from one tenant can pollute retrieval results for another. Before tools hit the model’s context window, the translation layer should prepend deterministic tenant prefixes to the tool names themselves (for example, tenantA_get_user). Without explicit namespacing, cross-tenant semantic drift will cause embedding collapse and your dynamic tool RAG will pull the wrong endpoint silently.

Auth boundary enforcement: The least-privilege token exchange described earlier works cleanly in a single-tenant model where one enterprise identity maps to one set of permitted endpoints. In a multi-tenant harness, the identity layer has to enforce that Team A’s agent can only exchange tokens for endpoints registered under Team A’s tenant scope. Cross-tenant tool calls need to be impossible by construction, not by policy documentation that someone has to remember to follow.

The same principle extends to execution isolation. Pre-warmed sandbox pools should remain generic at the infrastructure layer, with tenant identity, volumes, and cryptographic keys hot-swapped in at the moment of execution. This keeps operational readiness decoupled from idle compute costs, and ensures that no tenant’s execution context bleeds into another’s even under concurrent load.

I’ll be direct: multi-tenancy is where the complexity centralization problem compounds most sharply. The patterns above are tractable for a single team. Adding tenancy boundaries across retrieval, auth, execution isolation, and observability is a meaningful additional engineering surface. Teams building toward enterprise platform use should treat multi-tenancy as a first-class design constraint from day one, not a feature to add later.

What Comes Next

The broad adoption of MCP as a packaging standard was a genuinely useful development. Getting the industry to agree on a common way to describe and expose agent tools is valuable infrastructure. But the protocol was never the destination. It’s a packaging format. The destination is an execution environment sophisticated enough to treat packaging as an implementation detail.

The teams I’ve seen pull ahead in agent infrastructure aren’t the ones who made the right protocol call in 2024. Most of them made the same messy choices everyone else did. What separated them was how quickly they made the transport question boring. Once the harness exists, the protocol debate becomes a one-afternoon integration task instead of a six-month architectural commitment. That’s the actual leverage: not the architecture itself, but the organizational freedom it creates. The engineers who were arguing about MCP vs. CLI are now building agent logic. That’s where the compounding starts.

Stop Choosing Between MCP and CLIs. Build a Harness That Makes Transport Boring.

How We Got Here

What the Harness Needs to Do

The Tool RAG Problem

The Distributed Shell Memory Problem

The Safety Isolation Problem

The Auth Mapping Problem

The Complexity Centralization Problem

The Structural Trade-offs

The Multi-Tenancy Problem

What Comes Next

Published by Vijay Vijayasankar

Leave a comment Cancel reply

How We Got Here

What the Harness Needs to Do

The Tool RAG Problem

The Distributed Shell Memory Problem

The Safety Isolation Problem

The Auth Mapping Problem

The Complexity Centralization Problem

The Structural Trade-offs

The Multi-Tenancy Problem

What Comes Next

Share this:

Related

Published by Vijay Vijayasankar

Leave a comment Cancel reply