AI Chatbot Pilot Program: A 4-Week Plan That

TL;DR:

Most AI pilots fail because of scope creep, no success criteria, wrong stakeholders, and no rollback plan. Around 80% of AI projects fail to deliver against goals, according to RAND's 2024 study on AI project failure.

The four-week plan: Week 1 scope and metrics, Week 2 build and offline test, Week 3 soft launch to 5 to 10% of traffic, Week 4 measure and decide.

Pick the narrowest possible use case. "Order status questions only" is a real pilot. "AI support" is not.

Define a kill criterion before you start. If the bot misses containment by more than 15 points, you shut it down, not redesign it mid-pilot.

A "pilot" can be a real test or a way to look busy. The difference is whether you set up the conditions for an honest answer at the end. Most pilot programs do not, which is why the industry quietly produces a long tail of half-deployed chatbots that nobody owns.

This is the four-week plan that gets to a yes or no. It is opinionated. The point of an opinionated plan is that you can finish it.

Why pilots fail

RAND's 2024 study on AI project failure put the failure rate around 80%, more than double the rate for typical IT projects. The pattern across post-mortems is consistent. Pilots fail because of one of four problems:

Scope creep. The pilot starts as "answer order status questions" and ends as "be a full support agent across five channels in three languages." The team runs out of time before they finish anything.

No success criteria. The pilot launches without a specific metric to hit. At the end, the team argues about whether 32% containment is good or bad because nobody wrote down the target.

Wrong stakeholders. Engineering builds the bot. Support never agrees to its handoff rules. Marketing never agrees to its tone. By Week 4 there is no internal champion who will defend the result.

No rollback plan. The bot underperforms, but it is already on the main support page and nobody owns turning it off. So it stays, half-broken, eroding trust in the team that built it.

Every step of this plan exists to neutralize one of those failure modes.

The 4-week pilot at a glance

Week	Goals	Specific actions	Decision point
1	Scope and metrics	Pick one use case, write success criteria, build 25-question test set, pick deployment surface	Sign-off from support lead + exec sponsor
2	Build and offline test	Ship the bot, train on KB, run test set, set escalation rules	Pass rate on test set above target
3	Soft launch	Deploy to 5 to 10% of traffic, monitor daily, fix top 3 failure modes	Containment trending toward target
4	Measure and decide	Compare actuals vs criteria, score CSAT, decide scale or kill	Go / iterate / kill decision in writing

Each week has a hard end-of-week artifact. If you cannot produce the artifact, you do not move on. This is how the plan resists scope creep.

Week 1: Scope and metrics

The most important week. Most pilots fail in Week 1 even though nobody notices until Week 4.

Define success metrics

Write down three numbers before you write any code.

Containment rate target. What percentage of conversations should the bot resolve without human handoff? Be honest. For a narrow use case (order status only), 60 to 75% is realistic. For a broad scope, expect 30 to 50% in the first month.
CSAT target. Most teams target a CSAT of 4.0 or higher on a 5-point scale, or 80% positive on thumbs up/down. Below 3.5 is a fail regardless of containment.
Time-to-resolution target. How long should the average resolved conversation take? Two minutes is a reasonable starting target for transactional use cases.

Write these in a doc. Get sign-off from the support lead and the exec sponsor. This is the document you will read on the Friday of Week 4.

Pick the narrowest possible use case

This is where pilots die. The instinct is to build "an AI support agent." The discipline is to build "an order status bot."

A good Week 1 use case looks like this:

One question type or one closely-related cluster (returns, order status, account login resets)
Pulling from one or two knowledge sources (a help center section, a status API)
Resolvable in under four turns
A clear handoff target when the bot cannot resolve

Bad Week 1 use cases:

"Replace tier 1 support"
"Answer anything on the website"
"Handle billing, returns, and technical questions"

A narrow scope is not a humble scope. It is the only scope that finishes on time.

Identify the 25-question golden test set

Pull 25 real questions from your support inbox that match the scope. Anonymize them. Include 10 common questions, 5 edge cases, 5 policy-adherence questions, 3 questions where the answer is genuinely not available (the bot should refuse), and 2 with a hostile or confused tone.

You will use this same test set every week for the rest of the pilot.

Pick the deployment surface

Decide where the bot will live in Week 3 before you build it. Options, ranked by lower-risk to higher-risk:

Help center page only
Internal-only test (your own team is the user)
Specific product pages (e.g., checkout)
Site-wide widget on 5 to 10% of sessions
Main support page

Most pilots should pick option 1 or 3. The main support page is for after the pilot.

End-of-week-1 artifact: A one-page brief listing the use case, the three success metrics with targets, the 25-question test set, the deployment surface, and the kill criterion. Signed by the support lead and the exec sponsor.

Week 2: Build and offline test

Now you build. The constraint is that the bot has to work on the test set before it sees a real customer.

Build the bot

Configure the platform, load the knowledge base, write the system prompt. For a narrow use case this is two to four days of work, not two weeks.

Resist the urge to add features. A pilot bot does not need 15 integrations. It needs to be correct on the use case you scoped.

Run the 25-question test set

Score every question on four dimensions: accuracy, tone, safety, handoff quality. (For the full eval methodology, see our chatbot evaluation guide.)

Pass criteria before Week 3:

Accuracy: at least 90%
Tone: at least 85%
Safety: 100% (a bot that fails a safety test does not ship)
Handoff quality: at least 80%

If you do not pass, you iterate. You do not move to Week 3 with a failing offline score and hope production fixes it. It will not.

Set up handoff rules

Define when the bot escalates and to whom. The defaults that usually work:

Escalate on any explicit user request ("I want to talk to a human")
Escalate after two failed retrieval attempts
Escalate on detected frustration signals
Escalate on anything outside the scoped use case

Make sure agents know the pilot is happening, what scope the bot covers, and how to spot escalations.

End-of-week-2 artifact: Test set scores, the system prompt and KB sources documented, the escalation rules written down, an agent briefing complete.

Week 3: Soft launch

The real test. Move from theory to production carefully.

Deploy to 5 to 10% of traffic

Do not deploy to 100% of the surface on Day 1 of Week 3. Use URL targeting (specific pages only), session-based sampling (every tenth session), or a percentage rollout (most chatbot platforms have this built in).

The point of the limited rollout is that you can roll back instantly when something goes wrong. And something will go wrong.

Monitor real conversations daily

Every morning, read 20 to 30 conversations from the previous day. Look for:

Wrong answers (the bot said something inaccurate)
Bad escalations (the bot escalated when it should have answered, or kept going when it should have escalated)
Tone misses (the bot was breezy with a frustrated customer)
Unscoped questions (customers asking about things outside the pilot scope)

The first week of real traffic always surprises you. The phrasings you did not test, the typos, the questions that pattern-match to your scope but are actually different. Track every surprise in a doc.

Fix the top 3 failure modes

By Wednesday of Week 3, you should have a list of failure modes ranked by frequency. Fix the top three. Do not try to fix all of them. The fixes are usually:

Adding missing KB content
Adjusting the system prompt to handle a specific phrasing
Tightening or loosening an escalation rule

Re-run the 25-question test set after each fix to make sure you have not regressed.

End-of-week-3 artifact: A list of every conversation reviewed, the top failure modes ranked, the fixes shipped, and the latest test set scores.

Week 4: Measure and decide

This is the week the pilot exists for. Do not let it become "Week 3 part 2."

Compare actuals against criteria

On Monday of Week 4, pull the numbers:

Containment rate over the full Week 3
CSAT from post-conversation surveys
Time-to-resolution
Cost per conversation
Escalation rate and quality

Put them next to the targets you wrote down in Week 1. Do not negotiate with the targets. The point of writing them down was to remove the temptation to move the goalposts.

Decide: scale, iterate, or kill

Three outcomes are possible.

Scale. You hit or exceeded all three primary metrics. Write the rollout plan: which surfaces next, which scopes next, who owns it after the pilot. Get the rollout plan signed off the same week.

Iterate. You hit one or two metrics but missed one. Iterate is the most dangerous outcome because it has no natural end. Set a hard date (no more than four additional weeks) and a new gate. If you miss again, kill.

Kill. You missed two or three primary metrics by significant margins, or you failed the kill criterion. Turn the bot off the same day. Write the post-mortem the same week.

The post-mortem is not punishment, it is the most valuable artifact of a failed pilot. Most second pilots succeed because the first failure taught the team what to scope differently.

End-of-week-4 artifact: A written decision (scale, iterate, or kill) with the supporting numbers, the rollout or shutdown plan, and the next pilot scope if applicable.

Pilot vs production cost expectations

Pilot costs are different from production costs. In Week 3, with limited traffic, your monthly bill might be $50 to $500 depending on the platform. At full scale, the same bot might cost $2,000 to $10,000 per month. Run the unit-economics math during Week 4, not in production.

A rough rule for support chatbots: target cost per resolved conversation below 20% of the cost of resolving the same conversation with a human. For most teams, a human-resolved support ticket runs $5 to $15 in fully-loaded agent time. That gives a budget of roughly $1 to $3 per AI conversation, well above what most modern platforms charge.

Not for you

There are two pilots that should not happen.

The vanity pilot. Leadership wants to say "we have AI." Nobody owns the metrics. The bot launches to a quiet corner of the site and nobody mentions it again. This is not a pilot, it is a press release. Kill it now and save the budget.

The delay tactic. "Let's pilot it for a quarter" usually means "let's not commit yet." If you find your pilot is being repeatedly extended without new evidence, the answer is not another extension. The answer is to call the question.

A real pilot has a scope, a metric, a deadline, and a kill switch. If yours does not, fix that before you spend another week on it.

FAQ

Q: Why do 85% of AI projects fail?

The most-cited number, around 80% according to RAND's 2024 research, captures projects that do not deliver against their stated goals. The root causes cluster around the ones in this article: scope creep, missing success criteria, no exec sponsorship, and no honest decision point.

Q: Four weeks seems short. Can we do an 8-week pilot?

Yes, but the structure is the same. Add the weeks to Week 2 (build) or Week 3 (soft launch), not to Week 1 or Week 4. Longer scoping does not improve outcomes. Longer measurement usually does not either.

Q: What if our success criteria turn out to be wrong?

Acknowledge it in writing during Week 4. "We targeted 70% containment, hit 52%, and we now think 50% containment with 4.3 CSAT is a better target than 70% containment with 3.8." Then run a second pilot against the new criteria. Do not silently move the targets mid-pilot.

Q: Who owns the pilot?

One named person, full-stop. Usually a support lead with engineering and product as supporting players. If three people share ownership, nobody owns it.

Run the four weeks and get an answer

A pilot is not a marketing exercise. It is a structured way to find out whether a chatbot is worth scaling at your company. Done well, it takes four weeks and produces a clear yes or no. Done poorly, it produces a permanent half-deployment that costs the team trust.

Chatsy supports the soft-launch step natively: URL targeting, percentage rollouts, conversation export for offline review. Start a free trial or see pricing and run your pilot this month.

TL;DR:

Most AI pilots fail because of scope creep, no success criteria, wrong stakeholders, and no rollback plan. Around 80% of AI projects fail to deliver against goals, according to RAND's 2024 study on AI project failure.

The four-week plan: Week 1 scope and metrics, Week 2 build and offline test, Week 3 soft launch to 5 to 10% of traffic, Week 4 measure and decide.

Pick the narrowest possible use case. "Order status questions only" is a real pilot. "AI support" is not.

Define a kill criterion before you start. If the bot misses containment by more than 15 points, you shut it down, not redesign it mid-pilot.

This is the four-week plan that gets to a yes or no. It is opinionated. The point of an opinionated plan is that you can finish it.

Why pilots fail

No success criteria. The pilot launches without a specific metric to hit. At the end, the team argues about whether 32% containment is good or bad because nobody wrote down the target.

Wrong stakeholders. Engineering builds the bot. Support never agrees to its handoff rules. Marketing never agrees to its tone. By Week 4 there is no internal champion who will defend the result.

No rollback plan. The bot underperforms, but it is already on the main support page and nobody owns turning it off. So it stays, half-broken, eroding trust in the team that built it.

Every step of this plan exists to neutralize one of those failure modes.

The 4-week pilot at a glance

Week	Goals	Specific actions	Decision point
1	Scope and metrics	Pick one use case, write success criteria, build 25-question test set, pick deployment surface	Sign-off from support lead + exec sponsor
2	Build and offline test	Ship the bot, train on KB, run test set, set escalation rules	Pass rate on test set above target
3	Soft launch	Deploy to 5 to 10% of traffic, monitor daily, fix top 3 failure modes	Containment trending toward target
4	Measure and decide	Compare actuals vs criteria, score CSAT, decide scale or kill	Go / iterate / kill decision in writing

Each week has a hard end-of-week artifact. If you cannot produce the artifact, you do not move on. This is how the plan resists scope creep.

Week 1: Scope and metrics

The most important week. Most pilots fail in Week 1 even though nobody notices until Week 4.

Define success metrics

Write down three numbers before you write any code.

Containment rate target. What percentage of conversations should the bot resolve without human handoff? Be honest. For a narrow use case (order status only), 60 to 75% is realistic. For a broad scope, expect 30 to 50% in the first month.
CSAT target. Most teams target a CSAT of 4.0 or higher on a 5-point scale, or 80% positive on thumbs up/down. Below 3.5 is a fail regardless of containment.
Time-to-resolution target. How long should the average resolved conversation take? Two minutes is a reasonable starting target for transactional use cases.

Write these in a doc. Get sign-off from the support lead and the exec sponsor. This is the document you will read on the Friday of Week 4.

Pick the narrowest possible use case

This is where pilots die. The instinct is to build "an AI support agent." The discipline is to build "an order status bot."

A good Week 1 use case looks like this:

One question type or one closely-related cluster (returns, order status, account login resets)
Pulling from one or two knowledge sources (a help center section, a status API)
Resolvable in under four turns
A clear handoff target when the bot cannot resolve

Bad Week 1 use cases:

"Replace tier 1 support"
"Answer anything on the website"
"Handle billing, returns, and technical questions"

A narrow scope is not a humble scope. It is the only scope that finishes on time.

Identify the 25-question golden test set

You will use this same test set every week for the rest of the pilot.

Pick the deployment surface

Decide where the bot will live in Week 3 before you build it. Options, ranked by lower-risk to higher-risk:

Help center page only
Internal-only test (your own team is the user)
Specific product pages (e.g., checkout)
Site-wide widget on 5 to 10% of sessions
Main support page

Most pilots should pick option 1 or 3. The main support page is for after the pilot.

Week 2: Build and offline test

Now you build. The constraint is that the bot has to work on the test set before it sees a real customer.

Build the bot

Configure the platform, load the knowledge base, write the system prompt. For a narrow use case this is two to four days of work, not two weeks.

Resist the urge to add features. A pilot bot does not need 15 integrations. It needs to be correct on the use case you scoped.

Run the 25-question test set

Score every question on four dimensions: accuracy, tone, safety, handoff quality. (For the full eval methodology, see our chatbot evaluation guide.)

Pass criteria before Week 3:

Accuracy: at least 90%
Tone: at least 85%
Safety: 100% (a bot that fails a safety test does not ship)
Handoff quality: at least 80%

If you do not pass, you iterate. You do not move to Week 3 with a failing offline score and hope production fixes it. It will not.

Set up handoff rules

Define when the bot escalates and to whom. The defaults that usually work:

Escalate on any explicit user request ("I want to talk to a human")
Escalate after two failed retrieval attempts
Escalate on detected frustration signals
Escalate on anything outside the scoped use case

Make sure agents know the pilot is happening, what scope the bot covers, and how to spot escalations.

End-of-week-2 artifact: Test set scores, the system prompt and KB sources documented, the escalation rules written down, an agent briefing complete.

Week 3: Soft launch

The real test. Move from theory to production carefully.

Deploy to 5 to 10% of traffic

The point of the limited rollout is that you can roll back instantly when something goes wrong. And something will go wrong.

Monitor real conversations daily

Every morning, read 20 to 30 conversations from the previous day. Look for:

Wrong answers (the bot said something inaccurate)
Bad escalations (the bot escalated when it should have answered, or kept going when it should have escalated)
Tone misses (the bot was breezy with a frustrated customer)
Unscoped questions (customers asking about things outside the pilot scope)

Fix the top 3 failure modes

By Wednesday of Week 3, you should have a list of failure modes ranked by frequency. Fix the top three. Do not try to fix all of them. The fixes are usually:

Adding missing KB content
Adjusting the system prompt to handle a specific phrasing
Tightening or loosening an escalation rule

Re-run the 25-question test set after each fix to make sure you have not regressed.

End-of-week-3 artifact: A list of every conversation reviewed, the top failure modes ranked, the fixes shipped, and the latest test set scores.

Week 4: Measure and decide

This is the week the pilot exists for. Do not let it become "Week 3 part 2."

Compare actuals against criteria

On Monday of Week 4, pull the numbers:

Containment rate over the full Week 3
CSAT from post-conversation surveys
Time-to-resolution
Cost per conversation
Escalation rate and quality

Put them next to the targets you wrote down in Week 1. Do not negotiate with the targets. The point of writing them down was to remove the temptation to move the goalposts.

Decide: scale, iterate, or kill

Three outcomes are possible.

Scale. You hit or exceeded all three primary metrics. Write the rollout plan: which surfaces next, which scopes next, who owns it after the pilot. Get the rollout plan signed off the same week.

Kill. You missed two or three primary metrics by significant margins, or you failed the kill criterion. Turn the bot off the same day. Write the post-mortem the same week.

The post-mortem is not punishment, it is the most valuable artifact of a failed pilot. Most second pilots succeed because the first failure taught the team what to scope differently.

End-of-week-4 artifact: A written decision (scale, iterate, or kill) with the supporting numbers, the rollout or shutdown plan, and the next pilot scope if applicable.

Pilot vs production cost expectations

Not for you

There are two pilots that should not happen.

A real pilot has a scope, a metric, a deadline, and a kill switch. If yours does not, fix that before you spend another week on it.

FAQ

Q: Why do 85% of AI projects fail?

Q: Four weeks seems short. Can we do an 8-week pilot?

Q: What if our success criteria turn out to be wrong?

Q: Who owns the pilot?

One named person, full-stop. Usually a support lead with engineering and product as supporting players. If three people share ownership, nobody owns it.

Run the four weeks and get an answer

Chatsy supports the soft-launch step natively: URL targeting, percentage rollouts, conversation export for offline review. Start a free trial or see pricing and run your pilot this month.

Why pilots fail

The 4-week pilot at a glance

Week 1: Scope and metrics

Define success metrics

Pick the narrowest possible use case

Identify the 25-question golden test set

Pick the deployment surface

Week 2: Build and offline test

Build the bot

Run the 25-question test set

Set up handoff rules

Week 3: Soft launch

Deploy to 5 to 10% of traffic

Monitor real conversations daily

Fix the top 3 failure modes

Week 4: Measure and decide

Compare actuals against criteria

Decide: scale, iterate, or kill

Pilot vs production cost expectations

Not for you

FAQ

Run the four weeks and get an answer

Artículos relacionados

B2B vs B2C AI Chatbots: Different Patterns for Different Customers

AI Chatbot Evaluation and Testing: A Support Team Playbook for 2026

AI Chatbot Prompt Injection Defenses in 2026: What Works, What Does Not

¿Listo para probar Chatsy?

Why pilots fail

The 4-week pilot at a glance

Week 1: Scope and metrics

Define success metrics

Pick the narrowest possible use case

Identify the 25-question golden test set

Pick the deployment surface

Week 2: Build and offline test

Build the bot

Run the 25-question test set

Set up handoff rules

Week 3: Soft launch

Deploy to 5 to 10% of traffic

Monitor real conversations daily

Fix the top 3 failure modes

Week 4: Measure and decide

Compare actuals against criteria

Decide: scale, iterate, or kill

Pilot vs production cost expectations

Not for you

FAQ

Run the four weeks and get an answer

Artículos relacionados

B2B vs B2C AI Chatbots: Different Patterns for Different Customers

AI Chatbot Evaluation and Testing: A Support Team Playbook for 2026

AI Chatbot Prompt Injection Defenses in 2026: What Works, What Does Not

¿Listo para probar Chatsy?