We independently evaluate AI for capital markets.
AI Alpha Labs tests frontier models on real trade-operations workflows — scored by a deterministic engine, published with full methodology, and reproducible from the materials we release.
Not a model vendor. An independent evaluator.
Independent
We don't build the models we score. No commercial incentive to inflate a result — the only product is the evidence.
Reproducible
Every benchmark ships with its dataset, prompts, and a deterministic scorer. Re-run it and get the same number.
Transparent
We publish confidence intervals, variance, and the cases models fail — not just a headline accuracy figure.
Financial-services focused
Benchmarks built on real trade-operations workflows — confirmations, exceptions, settlement — not academic tasks.
A growing body of evidence.
Trade Confirmation Exception Identification
The flagship study: can frontier models detect, classify, and quantify settlement exceptions across seven asset classes, scored deterministically?
Benchmark Overview & Dataset
250 validated cases pairing counterparty confirmations against internal records, with ground truth and per-case scoring criteria.
Benchmark Results — Four Models, Three Labs
Four published evaluations across GPT-4o, Gemini 2.5 Pro, Gemini 2.5 Flash, and Claude Sonnet. Detection range 97.9–99.6%. Failures documented, not hidden.
Governance Frameworks
Six framework documents — Standards through Findings — frozen at v1.0 so every published result is reproducible and audit-defensible.
How evidence is built — end to end.
Benchmark Engineering
Construct validated cases across asset classes with ground truth and scoring criteria.
Open →02Evaluation
Run models blind to ground truth; grade with a deterministic engine, not model judgment.
Open →03QA
Two-reviewer ground truth, arithmetic verification, and dataset validation before any score counts.
Open →04Evidence
Findings advance Beta → Provisional → Confirmed → Published as evidence accumulates.
Open →05Publications
Results released with full methodology, prompts, and rubrics — reproducible by anyone.
Open →Trade Confirmation Exception Identification.
| # | Model | Detection | False neg. | Asset classes | Scoring | Status |
|---|---|---|---|---|---|---|
| 01 | GPT-4ogpt-4o-2024-08-06 | 97.9 | 1 | 7 / 7 | Deterministic | Published |
| 02 | Gemini 2.5 Progemini-2.5-pro | 99.6 | 1 | 7 / 7 | Deterministic | Published |
| 03 | Gemini 2.5 Flashgemini-2.5-flash | 99.2 | 1 | 7 / 7 | Deterministic | Published |
| 04 | Claude Sonnet 4.6claude-sonnet-4-6 | 98.8 | 3 | 7 / 7 | Deterministic | Published |
| 05 | GPT-4o v1.2gpt-4o-2024-08-06 | 99.2 | 1 | 7 / 7 | Deterministic | Published |
AAL-D-001 · n=250 · seven asset classes · scored by deterministic engine with per-case tolerances · 3,750 scored observations across five evaluations · detection range 98.8–99.6% (v1.2 prompt) · full per-dimension results in the research portal.
We don't benchmark models. We benchmark models on the work your desk actually does.
General leaderboards measure academic capability. AAL-D-001 measures trade-confirmation exception handling under production conditions.
Eight principles.
Typography is the brand.
Information presented with precision builds more trust than decoration. We let the work speak.
Data before decoration.
Every element earns its place by carrying information. Visual complexity that adds no meaning is removed.
Whitespace is confidence.
Density signals anxiety. Clarity signals command. We optimize for the reader, not the page.
Motion is subtle.
Animation that calls attention to itself is a distraction. Interfaces move only when movement carries meaning.
Every page is printable.
If content can't stand without interactive chrome, we reconsider the content.
Components earn their place.
We don't add UI because it looks standard. We add it because the reader needs it.
Consistency builds trust.
Predictable patterns lower cognitive load. Every result is read the same way as the last.
Simplicity scales.
Simple principles outlast clever systems. We optimize for reproducibility and extension.
Custom benchmarks and private briefings for institutional operations and risk teams.
Contact for access