AAL-RS-006Published

GPT-4o on AAL-D-001 (v1.2 multi-run)

Evaluations·Jul 2026·gpt-4o-2024-08-06

99.2% · [98.3%, 99.6%]

Detection (95% CI)

90.2%

Category accuracy

91.4%

Field accuracy

68.2%

Exposure accuracy

100.0%

Dual-exception recall

98.9%

Escalation accuracy

0.0%

False-positive rate

3 · 0 / 750

Runs · errors

A multi-run re-evaluation of GPT-4o on AAL-D-001 using the v1.2 prompt (v1.1 escalation policy + v1.2 internal-field asymmetry note). The original AAL-RS-001 evaluation was a single run under the v1.0 prompt, producing 97.9% detection without sub-metrics. This evaluation runs three times across all 250 cases under the same conditions as the Gemini and Claude evaluations, enabling full cross-model comparison. GPT-4o achieves 99.2% detection with 100% dual-exception recall and zero false positives — matching Gemini on all primary dimensions except exposure accuracy (68.2% vs ~76% for Gemini).

Why re-evaluate

The original GPT-4o evaluation (AAL-RS-001) was a single-run v1.0 evaluation that predated the multi-run methodology and sub-metric scoring developed during the Gemini evaluations. It produced a detection figure (97.9%) but no category, field, exposure, dual-exception, or escalation scores. This re-evaluation closes that gap and enables the first complete apples-to-apples comparison across all four models.

Detection improvement: v1.0 to v1.2

GPT-4o detection improved from 97.9% (single run, v1.0) to 99.2% (three runs, v1.2). The improvement is attributable to two factors: the v1.2 prompt fixes (escalation policy and internal-field note) resolving edge cases that v1.0 mishandled, and the multi-run methodology averaging out single-run variance. The 99.2% figure is the definitive GPT-4o result for AAL-D-001.

Exposure accuracy gap vs Gemini

GPT-4o exposure accuracy (68.2%) sits between Claude Sonnet (63.5%) and the Gemini models (~76%). All three non-Gemini models show meaningfully lower exposure arithmetic accuracy than Gemini Pro and Flash. This pattern — Gemini outperforming on financial P&L arithmetic while all models converge on detection — is now confirmed across five evaluations and is the most actionable finding in the AAL-D-001 corpus for practitioners choosing a model for production deployment.

Dual-exception recall

GPT-4o achieves 100% dual-exception recall, matching both Gemini models and confirming that the 87.4% figure for Claude Sonnet is a model-family difference rather than a general LLM limitation. OpenAI and Google models both catch all secondary exceptions; Anthropic's Claude misses the second exception in roughly 1 in 8 dual-exception cases.

View materials on GitHub Back to portal