Blogs
Mar 2025AI in product4 min read

Three Questions Before You Greenlight an AI Feature

Most AI features fail in production not because the model was wrong — but because nobody asked the right questions at scoping.

There's a pattern with AI features that fail in production. The demo looked clean. The model performed. The UX was smooth enough. Then it shipped, and within two months someone was writing a postmortem.

When you read back through the spec, the failure was always there. Not in the model choice or the prompt engineering or the data pipeline. In the questions nobody asked at scoping. The questions that felt premature at the time, or that got deferred to 'we'll figure it out in iteration.'

Here are three questions I now treat as mandatory gates before a greenlight. If you can't answer all three before writing a single line of the spec, the feature isn't ready to be scoped.

Does it remove a step the user is already doing?

This sounds obvious, but most AI features don't pass it. A feature that removes a step the user already performs manually has a ready-made user; the user already does the thing, they just do it slowly or inconsistently. A feature that creates a new behavior — something the user wasn't already doing — is asking the user to learn a new action and trust an AI with it simultaneously. That's a much steeper adoption curve.

Here's a worked example. Imagine a GenAI summarizer that turns raw, messy feedback into review-ready prose. It works because managers were already doing the condensation step in their head before typing — the AI just makes the artifact match what they'd have written anyway. Same output, less time. Adoption is fast because the value proposition is felt immediately on first use.

Contrast that with an AI feature that suggests 'things you didn't know to look for' in your data. That's adding a new step. You now have to review AI suggestions for patterns you had no prior intent to find, decide which matter, and figure out what to do with them. The feature isn't saving you work — it's creating work while calling it intelligence. Users who weren't already doing that analysis don't suddenly start because an AI surfaced it.

If the AI feature doesn't shorten a path the user is already on, it's a demo, not a feature.
What's the accuracy threshold — and does your model hit it?

Different flows have different tolerance for error, and those tolerances vary by an order of magnitude. Financial reconciliation needs 99%+. Legal document drafting probably needs 95%+. A first-draft email or a content summarization tool can survive 75% with a good correction loop. The number matters before the model matters.

What usually happens: the PM picks a model based on the demo. The demo uses clean, curated, representative data. The threshold question is never explicitly asked. Engineering builds the integration. QA tests against similar curated data. The feature ships. Production data is messier than demo data — it always is. Accuracy drops. Now there's a quality debate that should have been a scoping decision.

The sequence should be: define the accuracy threshold that makes the feature genuinely useful versus just technically present, test the candidate model against production-representative data before greenlighting, and document the threshold in the spec so that accuracy regressions have a clear reference point. If the model doesn't hit the threshold on production-like data, that's not a launch problem — that's a go/no-go gate at scoping.

What does the user see when it's wrong?

The fallback is a UX question, not an engineering question, and it belongs in the feature spec — not the post-launch bug backlog. Most AI specs ignore it entirely. The spec shows what happens when the AI is correct. The fallback is an afterthought.

A bad fallback looks like this: the model outputs something wrong, an error state fires, the user sees a toast that says 'Something went wrong. Please try again.' The flow breaks. The user loses context. They either retry, give up, or go manual. None of those outcomes are designed — they just happen.

A good fallback is designed for. The model exposes its confidence alongside its output. The user can see what the AI generated, understand how confident the model is, and correct it with low friction. The correction is captured. If your system learns from corrections, the correction actually improves future outputs. If it doesn't, at least the user has a smooth path back to manual.

The test I use: can a user who encounters a wrong output recover without leaving the flow? If the answer is no, the fallback isn't ready, and the feature isn't ready. Design the fallback before you design anything else.

If the fallback is an apologetic toast and a broken flow, the feature isn't ready.