AAL-F-002Confirmed

Residual errors concentrate in judgment-heavy booking exceptions

Findings·Jul 2026

Confirmed (3 models)

Evidence level

All 3 models · case 008

Shared miss

95.0% (vs 100% elsewhere)

Credit detection

Where current models fail on this workflow, the failures concentrate in exceptions requiring reasoning about booking entity and product structure rather than direct value comparison. GPT-4o, Gemini 2.5 Pro, and Gemini 2.5 Flash — different labs, different architectures — all miss the same complex credit booking-entity case. Both Gemini models additionally show elevated miss rates on complex credit cases generally (95.0% detection vs 100% on all other asset classes). This finding is now Confirmed across three independent models.

Operational implication

Human review should be concentrated on booking, entity, and product-structure exceptions rather than spread uniformly across the queue. All three models clear the high-volume mechanical breaks; the residual risk sits in a small, dense set of reasoning-heavy derivative cases. AI relocates senior operations judgment rather than removing it.

Back to portal