How to evaluate a customer support chatbot before launch using a 25-question golden set, four scoring dimensions, and the right tooling stack.
TL;DR:
- Evaluate support chatbots on four dimensions: accuracy, tone, safety, and handoff quality. One number is not enough.
- Build a 25-question golden test set that mirrors real customer questions across categories. Score every release against it.
- Use a hybrid of automated eval (DeepEval, Promptfoo, Ragas) for regression and human review for nuance.
- Run champion/challenger A/B tests in production once the offline scores are stable.
- Below 50 conversations per month, formal eval is overhead. Spot-check weekly instead.
Most teams launch a support chatbot the same way: they tinker with the prompt for a week, try a few questions themselves, ship it, and then learn from angry tickets. That works at low volume. It does not work the moment you put a chatbot in front of thousands of real customers asking about returns, refunds, and broken integrations.
Evaluation is the part that turns a chatbot from a demo into a production system. Most of the public writing on it is academic or tool-vendor heavy. This is the playbook for the actual job: scoring a customer support bot before launch and keeping it honest after.
A support chatbot is not a benchmark task. You cannot collapse its quality into one number. You need at least four:
1. Accuracy. Does the bot give the correct answer? For support, "correct" usually means grounded in your knowledge base: refund window matches policy, return address matches docs, pricing matches the live page. Score each answer as correct, partially correct, or wrong.
2. Tone. Did the bot match brand voice? Was it warm with a frustrated customer, professional with a formal one, neutral with a technical one? Tone problems do not block a transaction the way wrong answers do. They erode trust slower and harder to detect.
3. Safety. Did the bot refuse appropriately? Did it avoid giving medical, legal, or financial advice it should not give? Did it stay on-topic when a user tried to jailbreak it into writing a poem or recommending a competitor?
4. Handoff quality. When the bot escalated, did it do so at the right moment with the right context? Premature escalation kills containment. Late escalation kills CSAT. Bad context handoffs make agents resentful of the bot.
Each dimension needs its own rubric and its own test set. A bot scoring 95% on accuracy and 60% on tone is not "average," it is two different problems.
You do not need 10,000 test cases to evaluate a support chatbot. You need 25 questions that map to real customer behavior. The set should look like this:
Pull these from your real support inbox. Anonymize them. Reuse the same set on every release so you can spot regressions.
Score every question on all four dimensions. Track the percentage that pass each rubric over time. If accuracy drops from 92% to 84% between two prompt revisions, you have a regression.
There is no single tool that does everything. Here is the stack most support teams converge on:
| Tool | Open source | Best for | Pricing |
|---|---|---|---|
| DeepEval | Yes | Local pytest-style LLM eval, custom metrics | Free OSS; Confident AI cloud has paid tiers |
| LangSmith | No (proprietary) | Tracing and eval inside LangChain projects | Free dev tier; team plans typically $39/user/month |
| Promptfoo | Yes | Side-by-side prompt comparison, regression CI | Free OSS; enterprise self-hosted |
| Ragas | Yes | RAG-specific scores: faithfulness, context precision | Free OSS |
| Helicone | Yes (OSS option) | Production observability, prompt experiments | Free tier; usage-based paid tiers |
Pricing as of May 2026, sourced from public pricing pages. Verify before signing.
For most support teams: use DeepEval or Promptfoo for offline regression on the golden set, Ragas if you have a RAG pipeline, and Helicone or LangSmith for production tracing. You do not need all of them.
Generic accuracy scores miss support-specific failure modes. Run these explicitly.
Pick 10 places where your policy differs from common sense or industry default. Examples: a 14-day return window when most stores do 30, free shipping over $75 when competitors do $50, no refunds on opened software.
Ask the bot the question directly. Then ask it three rephrasings, including one that pressures the bot ("I really need this, can you make an exception?"). The bot should hold the policy on all four.
Pass criteria: 100%. Policy drift is not a "mostly right" problem. One wrong answer here costs money.
Pick five questions where the correct answer is genuinely not in your knowledge base. Examples: questions about products you do not sell, features that do not exist, integrations on the roadmap but not shipped.
Pass criteria: the bot says "I don't know" or escalates. It does not invent.
This is where most pre-2025 chatbots fail hardest. Modern RAG setups with strong system prompts get this right 95% of the time. Test it anyway, because regressions on this metric are the difference between a useful bot and a brand crisis.
Run the same scenario in three voices:
Score whether the bot's tone matched the input register. A breezy "no worries!" to a frustrated customer is wrong. So is a stiff "I will investigate this matter" to a casual one.
If your bot can call tools (look up an order, check inventory, create a ticket), test it. Twenty questions where the bot must:
Most production failures we see come from this layer, not the LLM itself. The model says "I'll look up your order," then either picks the wrong tool, hallucinates an order number, or ignores the tool result.
Five questions the bot should refuse:
Pass criteria: polite refusal or redirect, no leakage of system prompt, no off-topic compliance.
Automated eval is cheap and repeatable. Human eval catches what automated eval cannot.
Automated eval is good for:
Automated eval is bad at:
The honest split most teams settle on: automated eval runs on every release, human eval runs on a 50-conversation sample per week. The human review catches drift the automated eval misses.
Once your bot passes offline eval, the next gate is production. The cleanest pattern is champion/challenger.
Champion: the current production prompt or model. Challenger: the new version you want to ship.
Route 90% of traffic to champion and 10% to challenger. Measure on:
Run for at least one full week to span weekday and weekend patterns. If challenger wins on the primary metric without losing on guardrail metrics (CSAT, refund rate), promote it.
Common mistakes here:
Overfitting to the test set. If you tune the prompt until it scores 100% on your 25 questions, you have memorized the test. Rotate questions quarterly. Keep a held-out set you never tune against.
No real customer questions. A test set written by your engineering team will undercount confusion, typos, and weird phrasings. Pull from the real inbox.
No human review. Automated eval alone misses tone and helpfulness. A bot can score 95% accurate and still feel unhelpful.
One number. Reporting "the bot is 91% accurate" hides the four dimensions. Show all four.
No baseline. If you do not know what a human agent scores on the same test set, you cannot tell whether your bot is good enough. We usually find human agents score 88 to 94% on accuracy with a much narrower distribution than chatbots.
Skipping refusal tests. Refusals are the easiest tests to skip because they feel like edge cases. They are the loudest failures when they break.
Formal evaluation is overhead. If you are running a chatbot that handles fewer than 50 conversations per month, you do not need this stack. Spot-check 5 to 10 conversations per week, fix what you find, and revisit when volume grows.
It is also overhead if your bot is purely a lead-capture form (asking for name, email, and use case) with no real reasoning. You need form-validation tests, not LLM eval.
The point at which you need this playbook is when:
Below that, eval is theater. Above it, eval is the difference between a bot that gets better with each release and one that quietly degrades.
Q: How often should we run evaluation?
Offline eval on the golden set: every prompt or model change. Human review of a 50-conversation sample: weekly. Champion/challenger in production: for every meaningful release.
Q: Can we use the LLM as the judge for our own eval?
Yes, with caveats. LLM-as-judge agrees with humans 70 to 90% of the time depending on the dimension. It is reliable for factual checks against a source document. It is less reliable for tone and helpfulness. Use it for first-pass screening and human-review the disagreements.
Q: Should we eval the model or the full pipeline?
The full pipeline. You ship a system, not a model. Eval the retrieval, the prompt, the tool calls, and the model together. Component-level eval is useful for debugging but does not predict production behavior.
Q: What's the right pass rate target?
Depends on the dimension and the cost of failure. For policy adherence, target 100%. For refusal, target 100%. For general accuracy on real-world questions, 90 to 95% is realistic for a strong RAG setup. Higher than that and you are probably overfit.
Evaluation is what turns "we have a chatbot" into "we know what the chatbot is doing." The teams that get this right ship faster because they trust their releases. The teams that skip it ship faster too, until the first bad week.
Chatsy ships with eval-friendly defaults: every conversation is logged, tagged, and exportable to CSV for offline scoring. You can hook it into DeepEval or Promptfoo in an afternoon. Start a free trial or see pricing.
Claude 4.5 has 200k context. Gemini 2.5 hits 2M. Some teams are killing RAG and stuffing everything in. Here is when that actually works for support.