Research Portal

Every benchmark, evaluation, and finding — live.

The complete AI Alpha Labs corpus. Each entry carries its status, its evidence level, and a link to the materials that make it reproducible. Nothing is published until it is defensible.

AAL-R-2026-001Publications

Trade Confirmation Exception Identification

A production benchmark for whether frontier models can detect, classify, and quantify settlement exceptions across seven asset classes.

Published
AAL-D-001Benchmarks

Trade Confirmation Exception Identification — Dataset

250 validated cases pairing counterparty confirmations against internal records across cash and derivative products, with ground truth and per-case scoring criteria.

Active · v1.0
AAL-RS-001Evaluations

GPT-4o on AAL-D-001

First published evaluation: 97.9% first-pass detection across all 250 cases, with the single failure case documented rather than hidden.

Published
AAL-RS-002Evaluations

Gemini 2.5 Pro on AAL-D-001

99.6% detection, 100% dual-exception recall, 0 errors across 750 scored observations — and a persistent exposure arithmetic weakness that survives prompt improvement.

Published
AAL-RS-003Evaluations

Gemini 2.5 Flash on AAL-D-001

99.2% detection, 100% dual-exception recall, zero errors across 750 observations — statistically indistinguishable from Pro on every metric in v1.1.

Published
AAL-RS-004Evaluations

Prompt Sensitivity Analysis: v1.0 vs v1.1 on AAL-D-001

A 19-point escalation gap between Pro and Flash vanishes with one paragraph. Exposure arithmetic stays broken. Benchmark iteration separates specification gaps from genuine model limitations.

Published
AAL-RS-005Evaluations

Claude Sonnet 4.6 on AAL-D-001

98.8% detection, 0.0% false positive rate, 0 errors across 750 observations — with a lower exposure accuracy (63.5%) and dual-exception recall (87.4%) than Gemini, and a novel EXC-STAT false positive pattern requiring a v1.2 prompt fix.

Published
AAL-RS-006Evaluations

GPT-4o on AAL-D-001 (v1.2 multi-run)

99.2% detection, 100% dual-exception recall, 0 errors across 750 observations — closing the comparison table with full sub-metrics and confirming GPT-4o matches Gemini on all dimensions except exposure arithmetic.

Published
AAL-F-001Findings

Detection accuracy clusters near ceiling on objective exceptions

Across four independent model evaluations across three labs, first-pass detection on objective exceptions sits at 97.9–99.6% — the pattern is now confirmed across models, architectures, and labs.

Confirmed
AAL-F-002Findings

Residual errors concentrate in judgment-heavy booking exceptions

Three independent frontier models miss the same judgment-heavy derivative cases while clearing 99%+ of everything else — residual risk lives in entity- and structure-level reasoning.

Confirmed
AAL-F-003Findings

Models detect exceptions far more reliably than they quantify them

Detection sits at ~99.5% while exposure-arithmetic accuracy plateaus at ~75% across both models and both prompt versions — a 25-point gap that does not close with explicit formula guidance.

Confirmed
AAL-F-004Findings

Benchmark specification gaps can masquerade as model capability differences

A 19-point escalation gap between Pro and Flash in v1.0 vanished entirely in v1.1 after adding one paragraph of policy guidance — a case study in how underspecified benchmarks produce misleading model comparisons.

Confirmed
AAL-F-005Findings

Dual-exception recall and exposure accuracy vary meaningfully across model families

GPT-4o and Gemini models achieve 100% dual-exception recall; Claude Sonnet achieves 87.4%. Exposure accuracy splits by lab: Gemini ~76%, GPT-4o 68.2%, Claude 63.5% — a confirmed model-family pattern across five evaluations.

Provisional