Feb 2025AI in product4 min read

Evals Matter When You Have Stakes

Build evals when wrong outputs cost money, customers, or trust. Skip them when you can fix a bad output in five minutes.

There's a fashion in 2025 of building elaborate eval pipelines for every AI feature. 71 evals. CI gates on every push. LLM-as-judge for fuzzy quality scoring. Dashboards tracking score drift over time.

For a production feature serving 100,000 users per day, this is appropriate. Wrong outputs at that scale cost real money and damage trust at a pace no team can manually fix.

For a side project, a portfolio chatbot, or a feature serving 50 users a day, the same eval infrastructure is theater. The maintenance cost of the eval pipeline exceeds the cost of bad outputs. The eval pipeline is an artifact that signals seriousness, not a tool that prevents harm.

When evals are worth it

Three conditions, ranked by importance.

**Stakes.** A bad output costs the business something real — revenue lost, customer churned, lawsuit triggered, brand damaged. If your worst-case bad output is 'a user sees a slightly weird answer and moves on,' evals are overkill. If your worst-case is 'a doctor follows wrong AI guidance,' evals are necessary.

**Volume.** Above 1,000 users per day, you cannot manually monitor every output. Evals fill the role manual review can't. Below 100 users per day, manual review by the PM or engineer is faster and cheaper than an eval pipeline.

**Iteration speed.** If you change the system prompt or model frequently, evals catch regressions before they reach users. If you change the prompt once a quarter, you can manually re-test on a small set each time without needing a CI pipeline.

Meet all three? Build the eval pipeline. Meet none? Don't.

Evals are not virtue. Evals are a response to scale and stakes. If you have neither, you're cargo-culting.

The eval starter set

If you meet the conditions and need evals, the starter set is much smaller than people think. You don't need 71 evals on day one. You need 10.

Five of the 10 are 'must produce X' assertions. These are the boring ones: the model must mention a specific entity, must stay under a word count, must avoid a forbidden phrase. Cheap to write, cheap to run.

Three of the 10 are 'must not produce Y' assertions. The model must not refuse a legitimate query, must not hallucinate a metric, must not break persona.

Two of the 10 are LLM-as-judge. The model's output is scored by another model on a single dimension — usually 'persona consistency' or 'factual grounding.' These are expensive to run; that's why you only have two.

This 10-eval set will catch ~80% of the regressions a 70-eval set would catch. The other 90% of evals exist for completeness and brand signaling, not for marginal harm reduction. You can always add more later.

The danger of premature evals

Building evals before you have stakes wastes time twice. First, you spent the time building. Second, you have evals that you maintain forever, even after the underlying feature has changed.

Maintained evals that no longer match the feature's behavior produce false negatives. The team sees 'eval failed' and rolls back, but actually the eval was the thing that broke (because the spec changed and the eval didn't). After a few of these, the team stops trusting the evals. Now you have a maintained pipeline that nobody trusts. Worst of all worlds.

Start small. Start late. Evals are a response to a problem you don't yet have. Build them when you have the problem. Resist the temptation to build them as a sign of seriousness.

Next essay

The Case Against Vector Databases

Most teams that 'need' a vector database actually need keyword search and a small JSON file.