Production AI Demands a Degradation Strategy
Models drift, providers throttle, prompts break. If your feature has no plan for what to do when the AI is unavailable, you don't have a production feature.
The first time your AI feature breaks in production, three things happen at once. The provider throttles. A prompt change in last week's deploy didn't propagate to a config file. A new edge case starts hitting the model and produces nonsense. Each individually is a small problem. Together, they wedge the feature.
If you didn't plan for any of this, your feature now shows users an error state designed for a transient 500 — except this isn't transient. It's been broken for two hours. Customers are complaining on Twitter. Your engineer is paged at midnight.
A degradation strategy is what you wrote before this happened. It says: when the AI is unavailable, here's exactly what the user sees, what they can do, and how long they wait. The presence of this document is the difference between a one-hour incident and a four-day fire.
Plan for three levels. Most teams skip all three.
**Level 1: Provider rate-limited or slow.** The AI is up, but responses are taking 8x normal. Users wait. The right behavior here depends on the feature. Sometimes you fall back to a smaller/cheaper model. Sometimes you queue and tell the user 'higher than normal load, this will take ~30 seconds.' Never just spinner forever.
**Level 2: Provider down for hours.** You can't get any AI output. Now the feature has to gracefully reveal what it is when the AI is removed. For a summarizer, maybe the user sees the raw input and a 'AI summary is currently unavailable — manual review' note. For an autocomplete, maybe it just turns off invisibly. The user shouldn't see broken UI. They should see a version of the feature that works without AI, or a clear notice that the AI part is paused.
**Level 3: AI is up but giving nonsense.** Model degradation isn't always a clean outage. Sometimes the model starts hallucinating, looping, or producing offensive content. You need automated detection — a moderation layer or a confidence floor — that pulls bad outputs before they reach users. This is the hardest level to design because you have to detect the problem yourself. The provider won't tell you.
Level 3 is the level that destroys trust. The model is 'up' but wrong, and you didn't catch it.
When the feature degrades, who finds out first? Three possibilities.
- Your monitoring catches it. Best case. You page yourself. You fix or roll back before users notice. - Customer success catches it. They saw a wave of tickets. They tell you. You've already had hundreds of bad user interactions, but you fix it before it goes viral. - Twitter catches it. Worst case. The screenshot is going around. Now you're playing defense.
The gap between case 1 and case 3 is purely about whether you instrumented the feature with the right alarms. Most teams instrument latency and error rate. Almost none instrument output-quality signals: confidence drift, response length distribution, semantic similarity to expected output, presence of refusal phrases. These are the early indicators that something is wrong with the model itself. Build them.
The deepest layer of a degradation strategy is the contract you implicitly make with the user. Are you promising AI as a feature, or AI as an enhancement?
If AI as a feature: when AI is down, the feature is down. The user expects this. You commit to a strict SLA, you eat the cost of running fallback models, you make sure outages are short. The price of this contract is operational cost.
If AI as an enhancement: when AI is down, the manual flow still works. The user can still complete their job, just without the AI help. The product is more resilient. The price is that the AI version has to feel like an enhancement, not the core experience.
Most products start out promising the first contract and operationally fulfilling the second. That mismatch is what produces angry support tickets. Pick one. Build to it. Communicate it clearly.