Every benchmark, evaluation, and finding — live.
The complete AI Alpha Labs corpus. Each entry carries its status, its evidence level, and a link to the materials that make it reproducible. Nothing is published until it is defensible.
Trade Confirmation Exception Identification
A production benchmark for whether frontier models can detect, classify, and quantify settlement exceptions across seven asset classes.
Trade Confirmation Exception Identification — Dataset
250 validated cases pairing counterparty confirmations against internal records across cash and derivative products, with ground truth and per-case scoring criteria.
GPT-4o on AAL-D-001
First published evaluation: 97.9% first-pass detection across all 250 cases, with the single failure case documented rather than hidden.
Gemini 2.5 Pro on AAL-D-001
99.6% detection, 100% dual-exception recall, 0 errors across 750 scored observations — and a persistent exposure arithmetic weakness that survives prompt improvement.
Gemini 2.5 Flash on AAL-D-001
99.2% detection, 100% dual-exception recall, zero errors across 750 observations — statistically indistinguishable from Pro on every metric in v1.1.
Prompt Sensitivity Analysis: v1.0 vs v1.1 on AAL-D-001
A 19-point escalation gap between Pro and Flash vanishes with one paragraph. Exposure arithmetic stays broken. Benchmark iteration separates specification gaps from genuine model limitations.
Claude Sonnet 4.6 on AAL-D-001
98.8% detection, 0.0% false positive rate, 0 errors across 750 observations — with a lower exposure accuracy (63.5%) and dual-exception recall (87.4%) than Gemini, and a novel EXC-STAT false positive pattern requiring a v1.2 prompt fix.
GPT-4o on AAL-D-001 (v1.2 multi-run)
99.2% detection, 100% dual-exception recall, 0 errors across 750 observations — closing the comparison table with full sub-metrics and confirming GPT-4o matches Gemini on all dimensions except exposure arithmetic.
Detection accuracy clusters near ceiling on objective exceptions
Across four independent model evaluations across three labs, first-pass detection on objective exceptions sits at 97.9–99.6% — the pattern is now confirmed across models, architectures, and labs.
Residual errors concentrate in judgment-heavy booking exceptions
Three independent frontier models miss the same judgment-heavy derivative cases while clearing 99%+ of everything else — residual risk lives in entity- and structure-level reasoning.
Models detect exceptions far more reliably than they quantify them
Detection sits at ~99.5% while exposure-arithmetic accuracy plateaus at ~75% across both models and both prompt versions — a 25-point gap that does not close with explicit formula guidance.
Benchmark specification gaps can masquerade as model capability differences
A 19-point escalation gap between Pro and Flash in v1.0 vanished entirely in v1.1 after adding one paragraph of policy guidance — a case study in how underspecified benchmarks produce misleading model comparisons.
Dual-exception recall and exposure accuracy vary meaningfully across model families
GPT-4o and Gemini models achieve 100% dual-exception recall; Claude Sonnet achieves 87.4%. Exposure accuracy splits by lab: Gemini ~76%, GPT-4o 68.2%, Claude 63.5% — a confirmed model-family pattern across five evaluations.