benchmarkevaluationmethodology

Can AI Read Trade Confirmations? We Tested It.

AI Alpha Labs·Jul 1, 2026

We ran GPT-4o, Gemini 2.5 Pro, Gemini 2.5 Flash, and Claude Sonnet on the same 250-case trade confirmation benchmark. Here's what we found — and what we didn't expect.

Every day, operations teams at banks, hedge funds, and asset managers spend hours matching trade confirmations against internal records. A counterparty sends a confirmation. Your system has a record. Do they agree? If not, what's wrong, how bad is it, and who needs to know?

It's repetitive, high-stakes, and exactly the kind of task people assume AI should be able to handle. We decided to find out if it actually can.

What We Built

AAL-D-001 is a 250-case benchmark designed to measure how well large language models perform trade confirmation exception identification — the specific task of comparing a counterparty confirmation against an internal record, spotting discrepancies, classifying them, and deciding what to do.

The dataset covers seven asset classes (equities, fixed income, FX forwards, listed futures, interest rate swaps, options, and credit), three difficulty levels, and 16 exception types ranging from price mismatches to booking entity errors to duplicate submissions. 76 of the 250 cases have no exception at all — to measure whether models fabricate problems that aren't there.

We evaluated four models across three labs: GPT-4o (OpenAI), Gemini 2.5 Pro and Flash (Google DeepMind), and Claude Sonnet 4.6 (Anthropic). Each case was run three times to account for model non-determinism, giving us 3,000 scored observations in total. All results are reported with Wilson 95% confidence intervals.

The Headline Numbers

All four models are remarkably good at the core detection task.

Gemini 2.5 Pro — 99.6% · Gemini 2.5 Flash — 99.2% · Claude Sonnet 4.6 — 98.8% · GPT-4o — 97.9%

Three labs. Four models. All within two points of each other at or near ceiling. False positive rates are near zero across all conditions — the models aren't hallucinating exceptions that aren't there. That's a strong result. But how we got there is the more interesting story.

The Finding That Almost Wasn't

Our initial results showed something puzzling: Gemini Pro and Flash were nearly identical on detection, but Pro's escalation accuracy was 75.7% versus Flash's 94.8% — a 19-point gap.

That seemed significant. Was Pro systematically under-escalating? Was it more conservative? Was this a meaningful difference in how the two models reason about operational risk?

We dug in. It turned out to be none of those things.

Looking at the dataset, the ground truth escalation rule is almost entirely mechanical: every exception type except commission discrepancies and likely duplicate trades should be escalated. EXC-PRICE? Always escalate. EXC-SDATE? Always escalate. EXC-BOOK? Always escalate. The rule is nearly categorical.

But that rule was nowhere in our prompt. We just asked for escalation_required as a freeform boolean with no criteria.

Pro, being more conservative about inferring unstated rules, tended to look at apparent severity and dollar impact and reason: this is a 1-day settlement date mismatch with no dollar exposure and severity 3 — probably doesn't need escalation. That's a reasonable inference from the information it had. It was just wrong about the convention we were using.

Flash happened to guess closer to the ground truth rule, which made it look better — but that was luck, not capability.

We added a single paragraph to the prompt explaining the escalation policy. Pro jumped from 75.7% to 99.0%. Flash went from 94.8% to 98.1%. The gap closed almost entirely.

The lesson: a 19-point model gap disappeared with one paragraph. Any benchmark that doesn't test prompt sensitivity risks mistaking specification gaps for capability differences.

The Claude Surprise

When we ran Claude Sonnet 4.6 on the same prompt, we got a 67.5% false positive rate on the first pass — Claude was flagging exceptions on clean cases at massive scale.

Every single false positive was the same thing: EXC-STAT. Claude was treating the internal status field (showing 'Unconfirmed') as an exception because it appeared in the internal record but not the counterparty confirmation.

This is an expected asymmetry by design. Internal workflow fields like status, book, account, and broker are intentionally absent from counterparty confirmations — they're internal system fields, not confirmation fields. Gemini models handled this intuitively. Claude needed to be told explicitly.

One sentence fixed it entirely: 'Fields that appear only in the internal record and are absent from the counterparty confirmation are expected asymmetries, not exceptions.' After that, Claude's false positive rate dropped to 0.0%.

This is a real difference in model behavior — Claude has a stronger prior toward treating field-presence asymmetry as a discrepancy. Whether that's a weakness or a feature depends on the deployment context.

The Finding That Stuck

One weakness survived every prompt fix: exposure arithmetic.

For FX forward price breaks, the correct exposure is notional × absolute rate difference. A USD/KRW NDF with $8M notional and a 0.05 pip rate break has $400,000 of exposure.

Both Gemini models returned $290.28 — across all three runs, independently, on the same case. Claude made similar errors on a wider range of cases. Even after we added the exact formula to the prompt, accuracy improved by only 2–3 points for Gemini and didn't move for Claude.

This is not a specification gap. It's a genuine limitation in financial arithmetic reasoning that survives explicit instruction. Gemini reaches ~76% exposure accuracy; Claude reaches 63.5%. The gap between them is real, and neither number is good enough for production exposure reporting.

The Dual-Exception Gap

One result surprised us: dual-exception recall.

Both Gemini models catch both exceptions in dual-exception cases 100% of the time. Claude catches both 87.4% of the time — meaning in roughly 1 in 8 complex cases, Claude identifies the primary exception but misses a second simultaneous break.

In operations, a missed secondary exception means incomplete exception reporting. A counterparty break may get partially resolved while a second material discrepancy goes unnoticed. This is operationally significant for complex derivative cases.

What This Means for Practitioners

If you're thinking about using LLMs in your trade confirmation workflow, here's what this benchmark suggests:

Where AI is ready now: exception detection and classification. All four models identify whether a problem exists and what kind of problem it is with high reliability. This is the highest-value step — the one that currently requires a human to read every confirmation.

Where to be careful: exposure calculations, particularly for FX instruments. Treat model-generated exposure figures as a starting point, not a final answer. And if you're running Claude specifically, test carefully on clean cases with asymmetric internal fields.

Gemini vs Claude for this task: Gemini shows higher dual-exception recall (100% vs 87.4%) and higher exposure accuracy (~76% vs 63.5%). For production exception triage, Gemini Flash is the practical choice — equivalent detection accuracy to Pro at lower cost, better sub-metric performance than Claude on this specific task.

Pro vs Flash: After proper prompt specification, there's no statistically significant accuracy difference between them on this task. Flash is the cost-efficient choice for production deployment.

What's Next

AAL-D-001 is the first benchmark in the AI Alpha Labs evaluation suite. We're building toward a broader set of capital markets operations tasks — margin call processing, reconciliation break analysis, settlement failure prediction — with the same methodology: multi-run evaluation, Wilson confidence intervals, and explicit separation of prompt specification effects from model capability.

The full dataset, evaluation code, scorecards, and methodology are published at the AI Alpha Labs research portal.

View research portal Back to writing