AAL-RS-005Published

Claude Sonnet 4.6 on AAL-D-001

Evaluations·Jul 2026·claude-sonnet-4-6

98.8% · [97.7%, 99.4%]

Detection (95% CI)

90.8%

Category accuracy

88.1%

Field accuracy

63.5%

Exposure accuracy

87.4%

Dual-exception recall

97.7%

Escalation accuracy

0.0%

False-positive rate

3 · 0 / 750

Runs · errors

The fifth complete model evaluation on AAL-D-001, and the first from Anthropic. Claude Sonnet 4.6 was run three times across all 250 cases using the same multi-run methodology as the Gemini evaluations. An initial v1.1 run revealed a systematic false positive pattern where Claude flagged the internal status field as an EXC-STAT exception on clean cases, requiring a v1.2 prompt addition. After correction, Claude achieves 98.8% detection with zero false positives, but shows meaningfully lower exposure accuracy (63.5%) and dual-exception recall (87.4%) than both Gemini models.

The EXC-STAT false positive pattern

In the initial v1.1 run, Claude produced a 67.5% false positive rate. All 154 false positives were EXC-STAT: Claude treated the internal status field as an exception because it appeared in the internal record but not the counterparty confirmation. Internal workflow fields like status, book, account, and broker are intentionally absent from counterparty confirmations by dataset design. A one-sentence v1.2 prompt clarification eliminated all false positives. Gemini models did not exhibit this behavior, suggesting Claude has a stronger prior toward treating field-presence asymmetry as a discrepancy.

Exposure accuracy gap

Claude exposure accuracy (63.5%) is 12-13 points below both Gemini models (~76%). The same FX forward cases that trip up Gemini also trip up Claude, but Claude makes additional errors on other instrument types. The weakness in financial P&L arithmetic is shared across all models but more pronounced in Claude.

Dual-exception recall gap

Claude catches both exceptions in dual-exception cases 87.4% of the time, versus 100% for both Gemini models. In roughly 1 in 8 dual-exception cases, Claude identifies the primary exception but misses the secondary one — meaning exception reports may be incomplete on complex multi-break cases.

View materials on GitHub Back to portal