AAL-RS-002Published

Gemini 2.5 Pro on AAL-D-001

Evaluations·Jul 2026·gemini-2.5-pro

99.6% · [98.8%, 99.9%]

Detection (95% CI)

92.2%

Category accuracy

91.6%

Field accuracy

74.0%

Exposure accuracy

100.0%

Dual-exception recall

75.7% v1.0 → 99.0% v1.1

Escalation accuracy

0.0%

False-positive rate

3 · 0 / 750

Runs · errors

The third complete model evaluation on AAL-D-001, and the first to use multi-run methodology (3 runs × 250 cases = 750 scored observations). Gemini 2.5 Pro achieves 99.6% detection accuracy with perfect dual-exception recall and zero false positives in the clean v1.0 run. A follow-on v1.1 evaluation with explicit operational convention guidance reveals that escalation accuracy was severely underestimated in v1.0 due to a missing policy specification — and that exposure arithmetic is a genuine, persistent weakness.

Protocol

Three independent runs per case at temperature 0 using the Gemini API. All accuracy figures carry Wilson 95% confidence intervals. A resume-safe checkpoint system ensured no data loss across API quota interruptions. Errored slots from billing exhaustion were re-run after top-up to produce a fully clean 750-observation dataset.

The escalation gap was a benchmark artifact

v1.0 showed 75.7% escalation accuracy — a 19-point gap versus Flash's 94.8%. Dataset analysis revealed that ground truth escalation is effectively categorical by exception type: all price, quantity, settlement date, counterparty, SSI, booking, and allocation exceptions require escalation; only commission and most duplicate exceptions do not. This rule was absent from the v1.0 prompt. Adding one paragraph of escalation policy guidance in v1.1 raised Pro's escalation accuracy to 99.0%, closing the gap to Flash entirely. The v1.0 figure should not be interpreted as measuring model capability.

Exposure arithmetic is a genuine weakness

Despite an explicit calculation formula added in v1.1 (notional × absolute rate difference for price breaks; notional difference for quantity breaks), exposure accuracy improved only from 74.0% to 76.6%. The same FX forward cases that failed in v1.0 continue to fail, with both models returning wrong numbers reproducibly across all three runs. This points to a genuine limitation in financial P&L arithmetic reasoning — not a specification gap — and warrants treating model-generated exposure figures as indicative rather than authoritative.

View materials on GitHub Back to portal