Trade Confirmation Exception Identification
Trade confirmation matching is a high-volume, high-consequence operations workflow: a missed discrepancy between a counterparty confirmation and an internal record can become a settlement failure, a misbooked position, or a regulatory exception. This study evaluates whether frontier language models can perform first-pass exception identification under production conditions — detecting whether a discrepancy exists, classifying it within a controlled taxonomy, and quantifying the resulting exposure — graded by a deterministic scoring engine rather than model judgment. Two prompt versions (v1.0 and v1.1) were evaluated across Gemini 2.5 Pro and Gemini 2.5 Flash, revealing both a benchmark specification artifact in escalation scoring and a genuine, persistent model weakness in financial exposure arithmetic.
Motivation
General AI leaderboards measure capability on academic tasks. They do not tell an operations leader whether a model can be trusted on the specific work a middle-office desk performs every day. AAL-R-2026-001 is built to answer that narrower, more useful question for one concrete workflow.
Method
Each case pairs a counterparty confirmation with an internal record. Ground truth is constructed before any model is evaluated. Models are scored blind to ground truth by a deterministic engine that grades detection, category, field, and exposure arithmetic against per-case tolerances — with order-independent matching for dual-exception cases and explicit false-positive accounting on clean cases. Each model is run three times to quantify non-determinism; all accuracy figures carry Wilson 95% confidence intervals.
Key findings
Both Gemini 2.5 Pro and Flash achieve ~99.5% exception detection accuracy under well-specified prompts. A 19-point escalation accuracy gap between Pro and Flash in v1.0 was traced to a missing operational policy in the prompt — not model capability — and closed to near-zero with a one-paragraph fix. Exposure arithmetic (~75%) is a genuine, shared weakness that persists despite explicit formula guidance, pointing to a real limitation in financial P&L reasoning across currency pairs.