← Research Portal
AAL-F-003Confirmed

Models detect exceptions far more reliably than they quantify them

Findings·Jul 2026
~99.5% (both models)
Detection
~73–77% (both models)
Exposure accuracy
~22–25 pts
Gap
Confirmed
Evidence level

Across Gemini 2.5 Pro and Flash, in both v1.0 and v1.1 prompt configurations, the gap between detecting an exception and correctly quantifying its dollar exposure is large and stable: ~99.5% detection against ~73–77% exposure accuracy. Providing the exact calculation formula in v1.1 improved exposure accuracy by only 2–3 points. Both models return the same wrong numbers on the same FX forward cases, reproducibly across multiple runs. This finding is Confirmed: the pattern holds across independent models and survives prompt improvement.

Why the formula did not fix it

The v1.1 prompt added an explicit rule: for price breaks, multiply notional by the absolute rate difference. For a USD/KRW NDF with $8M notional and a 0.05 pip break, the correct exposure is $400,000. Both models returned $290.28 across all runs in both versions. This is not a missing instruction — it is a failure to correctly apply the instruction to FX forward P&L arithmetic. The weakness appears to be in the models' ability to reason about currency pair conventions and rate-to-dollar conversion, not in understanding the general formula.

Operational implication

Model-generated exposure figures should be treated as indicative rather than authoritative. For triage and screening, detecting the right breaks at 99.5% accuracy is high-value. For prioritization by materiality and regulatory exposure reporting, exposure figures require validation against system-of-record calculations.

Back to portal
Related