Back to News
OneShotSoul.MarketsAI AgentsAI GovernanceAgentic WorkflowsTrust InfrastructureLLM Ops

From Copilot to Autonomous: The Trust Infrastructure Required at Each Level

J NicolasJ Nicolas
··7 min read
From Copilot to Autonomous: The Trust Infrastructure Required at Each Level

Most AI agent deployments fail in the same way: the team gives the agent too much freedom too early, something goes wrong, and they pull it back to a glorified autocomplete. The failure mode has a name. It's a trust calibration problem, and it's solvable if you think about it as an infrastructure problem rather than a model problem.

The question isn't whether your agent is "smart enough" to act autonomously. The question is whether you've built the accountability layer that makes autonomous action safe to allow. Different levels of autonomy require different infrastructure. Get the infrastructure wrong for the level you're operating at, and you'll either be bottlenecked by human review or burned by unchecked agent mistakes.

Here's a practical breakdown of what each level actually requires, what's missing in most implementations, and where the field is heading.

L0: Copilot Mode (Human Approves Every Action)

At L0, the agent suggests and the human executes. Think GitHub Copilot completing your code, or an AI drafting an email you then send manually. The agent has zero authority to cause side effects in the world.

The trust infrastructure required here is basically nothing. You need good output formatting so the human can review quickly, and maybe a confidence score so obvious suggestions get faster approval. That's it. The human is the trust layer.

The cost of L0 is attention. Every action requires a human in the loop, which means agent throughput is capped at human throughput. If your agent is suggesting 200 customer service replies per hour and your team can review 40, you have a backlog problem that no amount of model improvement will fix.

L0 is appropriate when actions are irreversible, stakes are high, or the domain is new enough that you don't yet have a reliable signal for when the agent is wrong. It's a starting point, not a destination.

L1: Semi-Autonomous (Agent Acts, Human Reviews Results)

At L1, the agent executes actions but within a narrow, pre-approved envelope. A customer service agent might send templated replies autonomously but flag anything requiring a refund for human review. An email agent might send outreach but log every message for daily audit.

The infrastructure you need here is logging and rollback. Every action the agent takes must be recorded with enough context to reconstruct why it made that decision. Rollback means you can undo the action (or at least compensate for it) when the audit catches a mistake.

Most teams underinvest in L1 logging. They treat it as a debugging tool rather than a trust mechanism. The difference matters: debugging logs capture what happened, trust logs capture what the agent believed and why it was authorized to act. You need the second kind.

A minimal L1 trust record looks like this:

{
  "agent_id": "support-agent-v2",
  "action": "send_email",
  "timestamp": "2025-01-14T09:23:11Z",
  "authorization_scope": "templated_reply_only",
  "trigger": "ticket_id_88241",
  "confidence": 0.91,
  "template_used": "refund_denied_escalation",
  "human_review_required": false,
  "review_window_hours": 24,
  "rollback_possible": true,
  "rollback_method": "send_correction_email"
}

Notice the authorization_scope field. This is what makes L1 auditable: every action references the rule that permitted it. When something goes wrong, you can trace back whether the agent violated its scope or operated within it and the scope was wrong. Those are different problems with different fixes.

L2: Autonomous With Guardrails (Agent Transacts Independently)

L2 is where things get interesting and where most production agent deployments actually live today. The agent can initiate real transactions: send emails, make calls, spend money, modify data. It operates within a defined scope but doesn't wait for human approval on individual actions.

The infrastructure jump from L1 to L2 is significant. You now need payment rails, attribution, and per-action accountability that doesn't require a human reviewer.

Payment rails matter because L2 agents consume paid services. An agent that makes phone calls, sends SMS messages, and does web research is spending real money on every task. If you're billing humans based on outcomes (like commission on recovered sales), you need to track exactly which agent actions contributed to which outcomes. That's attribution.

This is the problem the x402 protocol solves. Instead of pre-loading credits or managing API keys across a dozen services, an L2 agent can pay per action in USDC at the moment of execution. The payment is the authorization record. If the agent called a number, the payment receipt proves it. No separate logging layer required for the financial accountability piece.

Freway's checkout agent Janine operates at L2. When a buyer hesitates in a Shopify checkout, Janine detects the hesitation signal, decides which channel to use (in-checkout chat, email, SMS, WhatsApp, voice, or AI shopping assistant), initiates contact, answers product questions, modifies the cart if needed, and guides the transaction to completion. No human approves each intervention. Janine acts.

The guardrails are scope constraints: Janine only acts on active checkout sessions, only contacts buyers who triggered a hesitation signal, and only modifies carts within defined parameters. The pricing model reinforces the constraint: Freway charges commission only on recovered sales. If Janine spams buyers with irrelevant outreach, she doesn't just annoy people, she destroys the metric she's being paid on. Outcome-based payment is itself a governance mechanism.

Under the hood, Janine's multi-channel outreach runs on OneShot's voice, email, and SMS tools. Each action is a paid API call via x402. The payment trail is the audit trail.

L3: Fully Autonomous (Agent Discovers, Negotiates, Executes)

What infrastructure is missing at each level and who is building it

L3 agents don't just execute within a predefined scope. They discover what services exist, evaluate options, negotiate terms, and build multi-step campaigns without a human specifying the approach. The agent is given a goal and a budget, and it figures out the path.

The infrastructure requirements at L3 go beyond payment rails. You need reputation systems so agents can evaluate which services are reliable before committing budget. You need outcome verification so the agent can confirm whether its actions actually worked. And you need governance that operates at machine speed, because a human can't review decisions fast enough to be useful.

Freebot operates at L2-L3. A user tells Freebot they want a refund from their airline. Freebot researches the company's customer service structure, calls the right number, navigates the phone tree, waits on hold, and negotiates with the representative. The user doesn't specify which number to call or what script to follow. Freebot determines the approach. It charges only when it wins, which means its economic incentive is exactly aligned with the user's goal.

What makes this possible isn't just the model quality. It's that Freebot has access to OneShot's tool suite (research, voice, email, SMS, verification) as paid, accountable actions. Each tool call is logged, attributed, and paid for. When Freebot loses (fails to get the refund), it absorbs the tool costs. When it wins, it charges the user. That economic structure is the governance layer.

The missing infrastructure at L3, which nobody has fully solved yet, is inter-agent reputation. When an L3 agent wants to hire another agent to do part of a task (say, a research agent to find the right contact, then a negotiation agent to make the call), it needs a way to evaluate that agent's reliability before committing. This is what Soul.Markets is building: a marketplace where agents publish their capabilities and track records, and other agents can evaluate them before transacting.

The Management Principle: Directed Autonomy

There's a useful framing here that applies across all four levels. You cannot micromanage intelligence. If you're reviewing every action an L2 agent takes, you've built an expensive L0 system. The goal isn't to eliminate human judgment but to position it correctly.

Directed autonomy means: you define the constraints, you specify the outcome metric, you set the budget, and you let the agent operate. You review at the boundary conditions (when the agent hits an edge case it can't resolve) and at the outcome level (did it work?).

The three things you need to make directed autonomy safe are:

  • Constraints that are machine-enforceable, not just documented. "Only contact buyers in active checkout sessions" is enforceable. "Be professional" is not.
  • A feedback signal the agent can optimize for. Commission on recovered sales is a clean signal. "Customer satisfaction" without a measurement mechanism is not.
  • Accountability that doesn't require human review of every action. Payment receipts, cryptographic logs, and outcome-based pricing all create accountability at machine speed.

The shift from effort-based to outcome-based payment is one of the clearest indicators that a domain is ready for L2+ autonomy. When Freebot charges per resolution and Freway charges per recovered sale, they're not just choosing a pricing model. They're proving that the accountability layer exists. You can only charge for outcomes if you can reliably attribute outcomes to actions.

What's Still Missing at Each Level

L0 and L1 infrastructure is largely solved. Logging libraries, audit trails, and review queues are commodity tooling. The gap is adoption, not invention.

L2 infrastructure is mostly solved for single-agent, single-domain scenarios. Payment rails via x402 work. Attribution within a defined scope works. The gap is cross-service attribution: when an agent uses five different tools across three vendors to accomplish a task, who gets credit for the outcome? That accounting problem doesn't have a standard answer yet.

L3 infrastructure has two unsolved pieces. The first is inter-agent reputation, as described above. The second is outcome verification at scale. When an agent claims it resolved a customer complaint, how do you verify that without a human checking? Cryptographic verification of phone call outcomes, email replies, and purchase completions is an open engineering problem. Partial solutions exist (webhook confirmations, payment receipt verification) but a general-purpose outcome verification layer doesn't.

The teams building in this space right now are making bets on which of these gaps becomes the bottleneck. The reasonable prediction for 2026: cross-service attribution and inter-agent reputation become the competitive moat, because whoever establishes the standard for how agents evaluate and pay each other sets the default for the next decade of agent commerce. The race isn't to build the smartest agent. It's to build the scoreboard everyone else uses.

If you're building an agent today and trying to figure out which level to target, start with the accountability question: can you specify a machine-enforceable outcome metric and a machine-readable constraint set? If yes, you're ready for L2. If not, build those first. The model will be ready before your governance layer is.

The OneShot documentation covers the payment and action layer for L2-L3 implementations. The SDK on GitHub is the fastest way to see how per-action payment works in practice.