Benchmark specification gaps can masquerade as model capability differences
The v1.0 evaluation showed Gemini 2.5 Pro with 75.7% escalation accuracy versus Flash at 94.8% — a 19-point gap that appeared to be a meaningful capability difference. The gap traced entirely to a missing escalation policy in the prompt: ground truth treated nearly all exception types as requiring escalation, but the prompt provided no guidance on this convention. Pro, reasoning conservatively from available information, under-escalated; Flash happened to infer a closer approximation to the unstated rule. One paragraph of explicit policy guidance in v1.1 closed the gap to 0.9 points (99.0% vs 98.1%), with confidence intervals fully overlapping. This is a methodological finding with implications beyond AAL benchmarks.
The mechanism
Ground truth escalation_required was effectively categorical by exception type — a rule that any experienced operations analyst would know implicitly but that was never stated in the benchmark prompt. Models were expected to infer the rule from context. Pro inferred conservatively (low apparent severity → no escalation); Flash inferred more aggressively. Neither behavior was wrong given the information provided. The benchmark was measuring inference about unstated rules, not escalation judgment.
Implication for benchmark design
Published benchmark scores should include prompt sensitivity testing, not just a single prompt result. A model that scores poorly on an underspecified benchmark may be more cautious and less prone to hallucinated conventions — a property that looks like a weakness in evaluation but is a strength in deployment. AAL will test prompt sensitivity as a standard part of the evaluation methodology going forward.