Independent AI evaluation · Capital markets

We independently evaluate AI for capital markets.

AI Alpha Labs tests frontier models on real trade-operations workflows — scored by a deterministic engine, published with full methodology, and reproducible from the materials we release.

See the AAL-D-001 results Why independent evaluation

250

Validated benchmark cases (AAL-D-001)

Frontier models published — GPT-4o, Gemini 2.5 Pro, Flash, Claude Sonnet & GPT-4o v1.2

97.9%

GPT-4o first-pass detection accuracy

Asset classes, cash through derivatives

AAL-D-001 Trade Confirmation Exception Identification · 250 casesAAL-R-2026-001 Published · v1.1 · Jul 2026 5 EVALUATIONS PUBLISHED GPT-4o · Gemini Pro · Gemini Flash · Claude Sonnet · GPT-4o v1.2OPSCORE-AI capital markets ops copilotGOVERNANCE six frameworks frozen at v1.0EVIDENCE Beta → Provisional → Confirmed → PublishedAAL-D-001 Trade Confirmation Exception Identification · 250 casesAAL-R-2026-001 Published · v1.1 · Jul 2026 5 EVALUATIONS PUBLISHED GPT-4o · Gemini Pro · Gemini Flash · Claude Sonnet · GPT-4o v1.2OPSCORE-AI capital markets ops copilotGOVERNANCE six frameworks frozen at v1.0EVIDENCE Beta → Provisional → Confirmed → Published

Why AI Alpha Labs

Not a model vendor. An independent evaluator.

Independent

We don't build the models we score. No commercial incentive to inflate a result — the only product is the evidence.

Reproducible

Every benchmark ships with its dataset, prompts, and a deterministic scorer. Re-run it and get the same number.

Transparent

We publish confidence intervals, variance, and the cases models fail — not just a headline accuracy figure.

Financial-services focused

Benchmarks built on real trade-operations workflows — confirmations, exceptions, settlement — not academic tasks.

Latest Research

A growing body of evidence.

All research →

AAL-R-2026-001Published

Trade Confirmation Exception Identification

The flagship study: can frontier models detect, classify, and quantify settlement exceptions across seven asset classes, scored deterministically?

v1.1 · Jul 2026Read →

AAL-D-001Dataset

Benchmark Overview & Dataset

250 validated cases pairing counterparty confirmations against internal records, with ground truth and per-case scoring criteria.

250 cases · v1.0Explore →

AAL-RS-001Result

Benchmark Results — Four Models, Three Labs

Four published evaluations across GPT-4o, Gemini 2.5 Pro, Gemini 2.5 Flash, and Claude Sonnet. Detection range 97.9–99.6%. Failures documented, not hidden.

4 models · Jul 2026View →

AAL-GOVFrozen

Governance Frameworks

Six framework documents — Standards through Findings — frozen at v1.0 so every published result is reproducible and audit-defensible.

6 documents · v1.0Read →

The AAL Operating System

How evidence is built — end to end.

Benchmark Engineering

Construct validated cases across asset classes with ground truth and scoring criteria.

Open →02

Evaluation

Run models blind to ground truth; grade with a deterministic engine, not model judgment.

Open →03

QA

Two-reviewer ground truth, arithmetic verification, and dataset validation before any score counts.

Open →04

Evidence

Findings advance Beta → Provisional → Confirmed → Published as evidence accumulates.

Open →05

Publications

Results released with full methodology, prompts, and rubrics — reproducible by anyone.

Open →

Benchmark · AAL-D-001

Trade Confirmation Exception Identification.

Dataset & methodology →

#	Model	Detection	False neg.	Asset classes	Scoring	Status
01	GPT-4ogpt-4o-2024-08-06	97.9	1	7 / 7	Deterministic	Published
02	Gemini 2.5 Progemini-2.5-pro	99.6	1	7 / 7	Deterministic	Published
03	Gemini 2.5 Flashgemini-2.5-flash	99.2	1	7 / 7	Deterministic	Published
04	Claude Sonnet 4.6claude-sonnet-4-6	98.8	3	7 / 7	Deterministic	Published
05	GPT-4o v1.2gpt-4o-2024-08-06	99.2	1	7 / 7	Deterministic	Published

AAL-D-001 · n=250 · seven asset classes · scored by deterministic engine with per-case tolerances · 3,750 scored observations across five evaluations · detection range 98.8–99.6% (v1.2 prompt) · full per-dimension results in the research portal.

Evaluation principle

We don't benchmark models. We benchmark models on the work your desk actually does.

General leaderboards measure academic capability. AAL-D-001 measures trade-confirmation exception handling under production conditions.

Philosophy

Eight principles.

Typography is the brand.

Information presented with precision builds more trust than decoration. We let the work speak.

Data before decoration.

Every element earns its place by carrying information. Visual complexity that adds no meaning is removed.

Whitespace is confidence.

Density signals anxiety. Clarity signals command. We optimize for the reader, not the page.

Motion is subtle.

Animation that calls attention to itself is a distraction. Interfaces move only when movement carries meaning.

Every page is printable.

If content can't stand without interactive chrome, we reconsider the content.

Components earn their place.

We don't add UI because it looks standard. We add it because the reader needs it.

Consistency builds trust.

Predictable patterns lower cognitive load. Every result is read the same way as the last.

Simplicity scales.

Simple principles outlast clever systems. We optimize for reproducibility and extension.

Custom benchmarks and private briefings for institutional operations and risk teams.

Contact for access