AAL-RS-004Published

Prompt Sensitivity Analysis: v1.0 vs v1.1 on AAL-D-001

Evaluations·Jul 2026·v1.0 → v1.1

Pro 75.7% vs Flash 94.8%

Escalation gap (v1.0)

Pro 99.0% vs Flash 98.1%

Escalation gap (v1.1)

~73% both models

Exposure accuracy (v1.0)

~76% both models

Exposure accuracy (v1.1)

1 paragraph

Prompt fix size

250 × 3 runs × 2 models × 2 versions

Cases

A comparative analysis of Gemini 2.5 Pro and Flash performance across two prompt versions: v1.0 (baseline) and v1.1 (with explicit exposure calculation rules and escalation policy). The analysis was motivated by an unexpectedly large escalation accuracy gap in v1.0 results. The findings demonstrate that benchmark specification quality is as important as model selection — a missing policy paragraph created a 19-point artificial gap — while simultaneously confirming that exposure arithmetic is a genuine, unresolved model weakness.

What changed between v1.0 and v1.1

Two additions: an explicit exposure calculation rule (notional × absolute rate difference for price breaks; notional difference for quantity breaks) and an escalation policy paragraph clarifying that nearly all exception types require escalation, with commission discrepancies and likely duplicates as the primary exceptions. Nothing else in the prompt template changed.

Finding 1: The escalation gap was entirely a specification artifact

Pro's v1.0 escalation accuracy was 75.7% — a 19-point gap versus Flash. Dataset analysis showed the ground truth rule was nearly categorical by exception type, but this rule was absent from the prompt. Pro, being more conservative about inferring unstated institutional conventions, under-escalated; Flash happened to guess closer to the implied rule. With explicit policy guidance, Pro jumped to 99.0% and Flash to 98.1%, with confidence intervals overlapping completely. The v1.0 gap measured prompt underspecification, not model capability.

Finding 2: Exposure arithmetic is a genuine, persistent limitation

Despite adding the exact calculation formula, exposure accuracy improved by only 2–3 points for both models, settling at ~75–77%. Case inspection reveals systematic wrong answers on FX forward rate breaks: for a USD/KRW NDF with $8M notional and a 0.05 pip rate difference (correct exposure: $400,000), both models returned $290.28 across all three runs. The formula was in the prompt; the models still computed incorrectly. This is not a specification gap — it is a genuine limitation in financial arithmetic reasoning that survives explicit instruction.

Methodological implication

Any benchmark that does not test prompt sensitivity risks attributing specification gaps to model capability differences. The AAL methodology of iterating prompts and publishing both versions is designed to make this distinction explicit. v1.0 and v1.1 results are not directly comparable and are published separately to preserve the methodological record.

View materials on GitHub Back to portal