AAL-RS-001Published

GPT-4o on AAL-D-001

Evaluations·Jun 2026·gpt-4o-2024-08-06

97.9%

Detection accuracy

False negatives

7 / 7

Asset classes

Deterministic

Scoring

The first complete model evaluation on AAL-D-001. GPT-4o was run blind to ground truth and scored by the deterministic engine across all 250 cases and seven asset classes.

Result

GPT-4o achieved 97.9% first-pass detection accuracy across the full set. The single false negative occurred on a complex credit booking exception — a case requiring the model to reason about booking entity rather than a surface field mismatch.

Why the failure case matters

We publish the failure rather than round it away. The residual error concentrating in a judgment-heavy booking exception, rather than in objective value mismatches, is itself a finding about where current models are and are not reliable for operations use.

View materials on GitHub Back to portal