Most chatbot pilots stall because there is no scope, no success criteria, and no kill switch. This is the four-week plan that fixes all three.
TL;DR:
- Most AI pilots fail because of scope creep, no success criteria, wrong stakeholders, and no rollback plan. Around 80% of AI projects fail to deliver against goals, according to RAND's 2024 study on AI project failure.
- The four-week plan: Week 1 scope and metrics, Week 2 build and offline test, Week 3 soft launch to 5 to 10% of traffic, Week 4 measure and decide.
- Pick the narrowest possible use case. "Order status questions only" is a real pilot. "AI support" is not.
- Define a kill criterion before you start. If the bot misses containment by more than 15 points, you shut it down, not redesign it mid-pilot.
A "pilot" can be a real test or a way to look busy. The difference is whether you set up the conditions for an honest answer at the end. Most pilot programs do not, which is why the industry quietly produces a long tail of half-deployed chatbots that nobody owns.
This is the four-week plan that gets to a yes or no. It is opinionated. The point of an opinionated plan is that you can finish it.
RAND's 2024 study on AI project failure put the failure rate around 80%, more than double the rate for typical IT projects. The pattern across post-mortems is consistent. Pilots fail because of one of four problems:
Scope creep. The pilot starts as "answer order status questions" and ends as "be a full support agent across five channels in three languages." The team runs out of time before they finish anything.
No success criteria. The pilot launches without a specific metric to hit. At the end, the team argues about whether 32% containment is good or bad because nobody wrote down the target.
Wrong stakeholders. Engineering builds the bot. Support never agrees to its handoff rules. Marketing never agrees to its tone. By Week 4 there is no internal champion who will defend the result.
No rollback plan. The bot underperforms, but it is already on the main support page and nobody owns turning it off. So it stays, half-broken, eroding trust in the team that built it.
Every step of this plan exists to neutralize one of those failure modes.
| Week | Goals | Specific actions | Decision point |
|---|---|---|---|
| 1 | Scope and metrics | Pick one use case, write success criteria, build 25-question test set, pick deployment surface | Sign-off from support lead + exec sponsor |
| 2 | Build and offline test | Ship the bot, train on KB, run test set, set escalation rules | Pass rate on test set above target |
| 3 | Soft launch | Deploy to 5 to 10% of traffic, monitor daily, fix top 3 failure modes | Containment trending toward target |
| 4 | Measure and decide | Compare actuals vs criteria, score CSAT, decide scale or kill | Go / iterate / kill decision in writing |
Each week has a hard end-of-week artifact. If you cannot produce the artifact, you do not move on. This is how the plan resists scope creep.
The most important week. Most pilots fail in Week 1 even though nobody notices until Week 4.
Write down three numbers before you write any code.
Write these in a doc. Get sign-off from the support lead and the exec sponsor. This is the document you will read on the Friday of Week 4.
This is where pilots die. The instinct is to build "an AI support agent." The discipline is to build "an order status bot."
A good Week 1 use case looks like this:
Bad Week 1 use cases:
A narrow scope is not a humble scope. It is the only scope that finishes on time.
Pull 25 real questions from your support inbox that match the scope. Anonymize them. Include 10 common questions, 5 edge cases, 5 policy-adherence questions, 3 questions where the answer is genuinely not available (the bot should refuse), and 2 with a hostile or confused tone.
You will use this same test set every week for the rest of the pilot.
Decide where the bot will live in Week 3 before you build it. Options, ranked by lower-risk to higher-risk:
Most pilots should pick option 1 or 3. The main support page is for after the pilot.
End-of-week-1 artifact: A one-page brief listing the use case, the three success metrics with targets, the 25-question test set, the deployment surface, and the kill criterion. Signed by the support lead and the exec sponsor.
Now you build. The constraint is that the bot has to work on the test set before it sees a real customer.
Configure the platform, load the knowledge base, write the system prompt. For a narrow use case this is two to four days of work, not two weeks.
Resist the urge to add features. A pilot bot does not need 15 integrations. It needs to be correct on the use case you scoped.
Score every question on four dimensions: accuracy, tone, safety, handoff quality. (For the full eval methodology, see our chatbot evaluation guide.)
Pass criteria before Week 3:
If you do not pass, you iterate. You do not move to Week 3 with a failing offline score and hope production fixes it. It will not.
Define when the bot escalates and to whom. The defaults that usually work:
Make sure agents know the pilot is happening, what scope the bot covers, and how to spot escalations.
End-of-week-2 artifact: Test set scores, the system prompt and KB sources documented, the escalation rules written down, an agent briefing complete.
The real test. Move from theory to production carefully.
Do not deploy to 100% of the surface on Day 1 of Week 3. Use URL targeting (specific pages only), session-based sampling (every tenth session), or a percentage rollout (most chatbot platforms have this built in).
The point of the limited rollout is that you can roll back instantly when something goes wrong. And something will go wrong.
Every morning, read 20 to 30 conversations from the previous day. Look for:
The first week of real traffic always surprises you. The phrasings you did not test, the typos, the questions that pattern-match to your scope but are actually different. Track every surprise in a doc.
By Wednesday of Week 3, you should have a list of failure modes ranked by frequency. Fix the top three. Do not try to fix all of them. The fixes are usually:
Re-run the 25-question test set after each fix to make sure you have not regressed.
End-of-week-3 artifact: A list of every conversation reviewed, the top failure modes ranked, the fixes shipped, and the latest test set scores.
This is the week the pilot exists for. Do not let it become "Week 3 part 2."
On Monday of Week 4, pull the numbers:
Put them next to the targets you wrote down in Week 1. Do not negotiate with the targets. The point of writing them down was to remove the temptation to move the goalposts.
Three outcomes are possible.
Scale. You hit or exceeded all three primary metrics. Write the rollout plan: which surfaces next, which scopes next, who owns it after the pilot. Get the rollout plan signed off the same week.
Iterate. You hit one or two metrics but missed one. Iterate is the most dangerous outcome because it has no natural end. Set a hard date (no more than four additional weeks) and a new gate. If you miss again, kill.
Kill. You missed two or three primary metrics by significant margins, or you failed the kill criterion. Turn the bot off the same day. Write the post-mortem the same week.
The post-mortem is not punishment, it is the most valuable artifact of a failed pilot. Most second pilots succeed because the first failure taught the team what to scope differently.
End-of-week-4 artifact: A written decision (scale, iterate, or kill) with the supporting numbers, the rollout or shutdown plan, and the next pilot scope if applicable.
Pilot costs are different from production costs. In Week 3, with limited traffic, your monthly bill might be $50 to $500 depending on the platform. At full scale, the same bot might cost $2,000 to $10,000 per month. Run the unit-economics math during Week 4, not in production.
A rough rule for support chatbots: target cost per resolved conversation below 20% of the cost of resolving the same conversation with a human. For most teams, a human-resolved support ticket runs $5 to $15 in fully-loaded agent time. That gives a budget of roughly $1 to $3 per AI conversation, well above what most modern platforms charge.
There are two pilots that should not happen.
The vanity pilot. Leadership wants to say "we have AI." Nobody owns the metrics. The bot launches to a quiet corner of the site and nobody mentions it again. This is not a pilot, it is a press release. Kill it now and save the budget.
The delay tactic. "Let's pilot it for a quarter" usually means "let's not commit yet." If you find your pilot is being repeatedly extended without new evidence, the answer is not another extension. The answer is to call the question.
A real pilot has a scope, a metric, a deadline, and a kill switch. If yours does not, fix that before you spend another week on it.
Q: Why do 85% of AI projects fail?
The most-cited number, around 80% according to RAND's 2024 research, captures projects that do not deliver against their stated goals. The root causes cluster around the ones in this article: scope creep, missing success criteria, no exec sponsorship, and no honest decision point.
Q: Four weeks seems short. Can we do an 8-week pilot?
Yes, but the structure is the same. Add the weeks to Week 2 (build) or Week 3 (soft launch), not to Week 1 or Week 4. Longer scoping does not improve outcomes. Longer measurement usually does not either.
Q: What if our success criteria turn out to be wrong?
Acknowledge it in writing during Week 4. "We targeted 70% containment, hit 52%, and we now think 50% containment with 4.3 CSAT is a better target than 70% containment with 3.8." Then run a second pilot against the new criteria. Do not silently move the targets mid-pilot.
Q: Who owns the pilot?
One named person, full-stop. Usually a support lead with engineering and product as supporting players. If three people share ownership, nobody owns it.
A pilot is not a marketing exercise. It is a structured way to find out whether a chatbot is worth scaling at your company. Done well, it takes four weeks and produces a clear yes or no. Done poorly, it produces a permanent half-deployment that costs the team trust.
Chatsy supports the soft-launch step natively: URL targeting, percentage rollouts, conversation export for offline review. Start a free trial or see pricing and run your pilot this month.
How to evaluate a customer support chatbot before launch using a 25-question golden set, four scoring dimensions, and the right tooling stack.