AAL-F-005Provisional

Dual-exception recall and exposure accuracy vary meaningfully across model families

Findings·Jul 2026

100.0%

Dual recall · GPT-4o + Gemini

87.4%

Dual recall · Claude

~76%

Exposure · Gemini

68.2%

Exposure · GPT-4o

63.5%

Exposure · Claude

Provisional

Evidence level

While detection accuracy is near-equivalent across all evaluated models, two sub-metrics show meaningful model-family divergence. Dual-exception recall: GPT-4o and both Gemini models catch all secondary exceptions (100%) while Claude Sonnet misses the second exception in ~1 in 8 dual-exception cases (87.4%). Exposure accuracy splits by lab: Gemini ~76%, GPT-4o 68.2%, Claude 63.5%. These gaps survive prompt correction and are now confirmed across five evaluations from three labs.

Operational implication of dual-exception gap

Missing a secondary exception means incomplete exception reporting. The 12.6-point dual-exception recall gap between Gemini and Claude is operationally significant for complex multi-leg trades and derivative cases where two simultaneous breaks are most common.

Why provisional

Two data points (Gemini family vs Claude) are insufficient to draw firm conclusions about model-family patterns. The finding advances to Confirmed if a third independent model family shows a consistent pattern on these dimensions.

Back to portal