AAL-RS-003Published

Gemini 2.5 Flash on AAL-D-001

Evaluations·Jul 2026·gemini-2.5-flash

99.2% · [98.3%, 99.6%]

Detection (95% CI)

90.2%

Category accuracy

91.4%

Field accuracy

72.9%

Exposure accuracy

100.0%

Dual-exception recall

94.8% v1.0 → 98.1% v1.1

Escalation accuracy

0.0%

False-positive rate

3 · 0 / 750

Runs · errors

The fourth complete model evaluation on AAL-D-001. Gemini 2.5 Flash was run three times across all 250 cases using the same multi-run methodology as the Pro evaluation. Flash matches Pro on detection, classification, and dual-exception recall while demonstrating higher escalation accuracy in v1.0 — a difference subsequently traced to prompt specification rather than model capability. In v1.1 both models converge to near-identical performance across all metrics.

Protocol

Three independent runs per case at temperature 0 with a 4-second inter-case sleep to manage rate limits. Identical dataset, prompt template, and deterministic scorer as all prior evaluations. Wilson 95% confidence intervals on all pooled figures.

Flash vs Pro: not a meaningful capability gap

In v1.0, Flash showed 94.8% escalation accuracy versus Pro's 75.7% — appearing to be a substantial advantage. The v1.1 prompt revision revealed this was a specification artifact: Flash happened to guess closer to the ground truth escalation convention, while Pro reasoned more conservatively from the available information. With explicit policy guidance, both models reach ~99% escalation accuracy and are statistically indistinguishable across all seven primary metrics.

Practical implication

Flash delivers equivalent accuracy to Pro on this task at substantially lower cost. For production screening and triage use cases, there is no accuracy-based argument for preferring Pro. The shared exposure arithmetic weakness (~73%) applies equally to both models and is the primary remaining gap for both.

View materials on GitHub Back to portal