Apr 2025AI in product4 min read

Why Your AI Accuracy Benchmark Is Dishonest

Curated test sets always perform better than production data. If you're not testing on production-shaped inputs, your benchmark is a press release.

Every AI demo has a number attached. 92% accuracy. 95% precision. 4.5/5 user satisfaction. The number sells the demo. The number is almost always wrong.

Not wrong because the team is lying. Wrong because the test set was curated. The model performed well on the inputs the team selected to evaluate it. The team didn't intentionally cherry-pick — but the inputs they chose were 'representative' in the sense that they were clean, well-formed examples of the use case. Production inputs are messier in a hundred ways nobody documents.

Where the gap comes from

Test sets are filtered through human attention. The team running the eval picked inputs that exercised the feature interestingly. Edge cases got included, but only the edge cases the team could imagine. The unimaginable edge cases — the typos, the partial inputs, the user behaviors nobody anticipated — aren't in the eval set because the team didn't think to put them there.

Production has all of those. Users paste half a paragraph and a stray URL. They submit before they're done. They type in two languages. They use the feature in ways the team never imagined. The model wasn't tested on any of this. The model's behavior on this traffic is unknown, even if the curated benchmark says 92%.

92% on a curated test set tells you almost nothing about production behavior. It tells you the team built a benchmark.

The fix: production-shaped evals

Two structural moves.

First, sample evals from real production data, not from curated examples. Once a week, randomly sample 50 production inputs that hit your feature. Run them through the model. Score the outputs. That number is your real accuracy. It will be lower than the curated benchmark, usually by 10-20 percentage points. That gap is the lie your curated benchmark was telling you.

Second, structure your eval set to include the categories of failure you actually see in support tickets. If 30% of tickets mention the model misreading short inputs, your eval set should have 30% short inputs. Match the eval distribution to the production distribution. If you don't, you're optimizing for a use case that doesn't exist.

Doing both lifts your real accuracy over time. Doing neither lets the gap between the press release number and the user-experienced number compound. Eventually, the user-experienced number is so bad that retention craters and no internal report explains why.

When the benchmark gap is OK

There are cases where a curated benchmark is informative. Pre-release evals for new model versions, for example — comparing model A to model B on the same fixed test set is the right thing to do. The fixed test set is doing the job of normalizing the comparison.

What's not OK is reporting that fixed-test-set number as if it represents production performance. They're different measurements. The team should be doing both: model-vs-model comparison on a fixed set, AND ongoing tracking of production-sampled accuracy. The first guides which model to ship. The second tells you whether shipping it is going well.

Next essay

Production AI Demands a Degradation Strategy

Models drift, providers throttle, prompts break. If your feature has no plan for what to do when the AI is unavailable, you don't have a production feature.