← Research Portal
AAL-F-005Provisional

Dual-exception recall and exposure accuracy vary meaningfully across model families

Findings·Jul 2026
100.0%
Dual recall · GPT-4o + Gemini
87.4%
Dual recall · Claude
~76%
Exposure · Gemini
68.2%
Exposure · GPT-4o
63.5%
Exposure · Claude
Provisional
Evidence level

While detection accuracy is near-equivalent across all evaluated models, two sub-metrics show meaningful model-family divergence. Dual-exception recall: GPT-4o and both Gemini models catch all secondary exceptions (100%) while Claude Sonnet misses the second exception in ~1 in 8 dual-exception cases (87.4%). Exposure accuracy splits by lab: Gemini ~76%, GPT-4o 68.2%, Claude 63.5%. These gaps survive prompt correction and are now confirmed across five evaluations from three labs.

Operational implication of dual-exception gap

Missing a secondary exception means incomplete exception reporting. The 12.6-point dual-exception recall gap between Gemini and Claude is operationally significant for complex multi-leg trades and derivative cases where two simultaneous breaks are most common.

Why provisional

Two data points (Gemini family vs Claude) are insufficient to draw firm conclusions about model-family patterns. The finding advances to Confirmed if a third independent model family shows a consistent pattern on these dimensions.

Back to portal
Related