Residual errors concentrate in judgment-heavy booking exceptions
Where current models fail on this workflow, the failures concentrate in exceptions requiring reasoning about booking entity and product structure rather than direct value comparison. GPT-4o, Gemini 2.5 Pro, and Gemini 2.5 Flash — different labs, different architectures — all miss the same complex credit booking-entity case. Both Gemini models additionally show elevated miss rates on complex credit cases generally (95.0% detection vs 100% on all other asset classes). This finding is now Confirmed across three independent models.
Operational implication
Human review should be concentrated on booking, entity, and product-structure exceptions rather than spread uniformly across the queue. All three models clear the high-volume mechanical breaks; the residual risk sits in a small, dense set of reasoning-heavy derivative cases. AI relocates senior operations judgment rather than removing it.