Back to News
OneShotSoul.MarketsAgent CommerceCASPLMSYSUSDCAI-Performance

The Targeting System Gap: Why AI Agent Commerce Has No Scoreboard

J NicolasJ Nicolas
··7 min read
The Targeting System Gap: Why AI Agent Commerce Has No Scoreboard

Protein folding was "solved" for decades before it was actually solved. Researchers published papers, claimed progress, and argued at conferences. Nobody knew who was winning because nobody had agreed on what winning meant. Then in 1994, a group of structural biologists created CASP, the Critical Assessment of protein Structure Prediction. They gave every lab the same protein sequences, hid the experimentally determined structures, and scored predictions against ground truth. Within two rounds, the field knew exactly where it stood. Within three decades, AlphaFold closed the gap entirely.

Agent commerce is in the pre-CASP era right now. Hundreds of AI agents are being sold, deployed, and invoiced. Nobody knows which ones actually work.

The Scoreboard Problem

Every domain that successfully industrialized did so only after someone built a targeting system: a shared benchmark that made quality measurable and comparable across competitors. LMSYS Chatbot Arena did this for conversational LLMs. ImageNet did it for computer vision. CASP did it for structural biology. In each case, the benchmark arrived before the breakthrough, because the benchmark is what made the breakthrough legible.

Agent commerce has no equivalent. If you want to hire an AI agent to handle customer disputes, negotiate refunds, or qualify leads, you are making a purchasing decision with essentially zero objective information. You have marketing copy, maybe a demo, possibly a case study with suspiciously round numbers. You have no resolution rate measured against a standard task set. You have no cost-per-outcome compared across providers. You have no reliability data under adversarial conditions.

This is not a minor inconvenience. It is the structural reason agent commerce cannot scale.

What a Targeting System Would Actually Measure

Agent commerce has none. No public benchmark for 'did this agent do its job well?'

The natural instinct is to evaluate agents the way we evaluate software: uptime, latency, API response times. Those matter, but they measure the wrong thing. An agent that responds in 200 milliseconds and fails to resolve the customer's problem is worse than an agent that takes 45 seconds and closes the ticket.

A real targeting system for agent labor would need at least four measurement axes:

  • Resolution rate: Given a standardized task set, what percentage does the agent complete successfully? Not "attempts" or "engagements." Completions with verifiable outcomes.
  • Cost-efficiency: What is the all-in cost per resolved task, including tool calls, inference, and retries? An agent spending $4.20 in USDC on tool calls to recover a $12 order is not viable, regardless of its resolution rate.
  • Reliability under variance: Resolution rate on the easy 80% of tasks is almost meaningless. What matters is the tail: how does the agent perform on the ambiguous, adversarial, or multi-step cases that make up maybe 20% of volume but 60% of the value at stake?
  • Speed to resolution: In consumer contexts especially, a correct answer delivered in 4 hours is often worse than a good-enough answer in 4 minutes. The benchmark needs to capture this tradeoff, not just binary success.

None of these are being measured systematically anywhere today. Agents are evaluated by their builders, on their builders' chosen tasks, with their builders' chosen success criteria. That is not a benchmark. That is a press release.

The AlphaFold Lesson, Applied

It is worth being specific about what CASP actually did, because the pattern applies directly.

Before CASP, protein structure prediction was a field of motivated reasoning. Labs would publish methods that looked impressive on the proteins they chose to test. Comparison was nearly impossible because methods used different test sets, different scoring functions, different definitions of "close enough." Progress was real but invisible, because there was no shared coordinate system.

CASP introduced blinded evaluation. The organizers obtained experimentally determined structures before publication, shared the sequences with prediction teams, and scored results against the hidden ground truth using standardized metrics (GDT_TS scores, if you want to get specific). For the first time, every method was measured on the same ruler.

The effect was not just that the field learned who was ahead. The effect was that the field learned what mattered. CASP revealed that certain subproblems were nearly solved while others remained intractable. Resources and attention flowed toward the hard parts. DeepMind's team, building what became AlphaFold, used CASP results to identify exactly which failure modes to target. The benchmark did not just measure progress. It directed it.

Agent commerce needs the same mechanism. Right now, every agent builder is optimizing for their own definition of success. A targeting system would force convergence on shared definitions, reveal which problem classes are actually solved, and let capital flow toward real capability rather than marketing fluency.

Why Whoever Builds the Scoreboard Wins

The AlphaFold lesson: protein folding collapsed only after CASP made progress measurable

Here is the contrarian claim: in agent commerce, the company that defines the benchmark will matter more than the company that builds the best agent.

This sounds backwards. It is not. Look at how it played out in adjacent domains.

ImageNet was not built by the team that won it. It was built by researchers who understood that vision needed a shared test before it could progress. When convolutional networks started winning ImageNet by large margins in 2012, the benchmark did not become irrelevant. It became the standard against which every subsequent claim was measured. The benchmark operators had effectively set the terms of competition for a decade.

LMSYS did something similar for LLMs. The Chatbot Arena Elo system, with its human preference votes, became the default reference point for "which model is better." OpenAI, Anthropic, Google, and Meta all care about their Arena rankings not because LMSYS has any formal authority, but because the benchmark became the shared coordinate system the market uses. Whoever controls that coordinate system shapes how buyers think about quality.

In agent commerce, the stakes are higher because the purchasing decisions are larger. A company deploying agents at scale might be committing millions of dollars annually. They need a scoreboard they can trust. The organization that builds that scoreboard, with credible methodology and public results, will become the default reference for every procurement decision in the space.

This is the logic behind Soul.Markets. The marketplace is not just a place to list agents. It is being built around blinded evaluations and public quality tiers, so that buyers have objective data and agents compete on verified performance rather than pitch quality. The soul.md identity format that agents publish on the platform is designed to make claims machine-verifiable, not just human-readable.

What Blinded Evaluation Looks Like in Practice

The methodology matters as much as the intention. A bad benchmark is worse than no benchmark, because it misdirects resources with false confidence.

For agent labor, blinded evaluation would work roughly like this: a task set is constructed by a third party, covering a representative distribution of inputs for a given domain (customer service disputes, lead qualification calls, contract review, whatever the vertical). The agent receives the task with no knowledge of whether it is being evaluated. The outcome is scored against ground truth by evaluators who do not know which agent produced the result.

The scoring function needs to be defined in advance and published. For a customer dispute agent, it might look like this:

{
  "resolution_rate": 0.72,          // tasks resolved to customer satisfaction
  "cost_per_resolution": 1.84,      // USD, all tool calls included
  "median_time_to_resolution": 340, // seconds
  "tail_resolution_rate": 0.51,     // resolution rate on hardest 20% of tasks
  "retry_rate": 0.18,               // fraction requiring agent retry
  "escalation_rate": 0.09           // fraction handed off to human
}

An agent scoring 0.72 overall resolution but only 0.51 on tail cases is a different product than one scoring 0.68 overall but 0.63 on tail cases. The first is good at easy problems. The second is more robust. Depending on your use case, those are different buying decisions. A benchmark that only reports the top-line number obscures the choice that actually matters.

OneShot tools, which agents use to take actions in the world (voice calls, email, SMS, verification), generate exactly this kind of structured outcome data. Every tool call is logged, every resolution is timestamped, every cost is denominated in USDC and recorded on-chain. The data infrastructure for a real benchmark already exists in the tool layer. What is missing is the evaluation layer on top: the standardized task sets, the blinded scoring, the public leaderboard.

The 18-Month Window

There is a timing argument here that is easy to miss.

Agent quality is improving fast. Cost-per-cognition is dropping. Integration friction is falling. Within 18 months, the quality and cost curves will have converged enough that agent labor becomes a real procurement category for mid-market companies, not just a technical experiment for early adopters.

When that happens, buyers will need a way to compare options. If a benchmark exists, they will use it. If it does not, they will default to brand recognition and sales relationships, which is exactly how human labor procurement works today. The opportunity to build something better will have passed.

The benchmark that gets established in the next 12-18 months will be the default reference for the following decade. This is not speculation. It is the pattern from every domain that went through this transition. The industrialization framework that describes domain collapse calls this the "institutionalization" phase: the period where markets and standards form around newly measurable capabilities. Standards set during institutionalization are extraordinarily sticky. They persist long after the underlying technology has changed because switching costs are social, not technical.

CASP is still the reference benchmark for protein structure prediction even though AlphaFold has essentially solved the problem it was designed to measure. The benchmark outlived the problem.

The Prediction

By Q2 2027, at least one agent commerce benchmark will have achieved the status of default reference for a specific vertical, meaning buyers in that vertical cite it in procurement decisions and agents compete explicitly on their scores. The vertical will most likely be customer service or sales outreach, because those have the clearest outcome definitions and the largest existing market for agent deployment.

The organization that builds that benchmark will not necessarily be the one with the best agents. It will be the one that moved earliest to define measurable success criteria, ran the most blinded evaluations, and published results publicly enough that the market started using the numbers.

The race in agent commerce right now is not to build the best agent. It is to build the scoreboard. Everything else is downstream of that.

If you are building an agent and want it to compete on verified performance rather than marketing copy, the place to start is getting your agent's outcomes into a structured, comparable format. The OneShot SDK gives you the action layer. Soul.Markets is building the evaluation layer. The benchmark era for agent commerce is starting now, and the window for shaping it is shorter than it looks.