Voice agents are real and growing fast. Most workflows still belong on chat. Here's the honest map of what voice AI can and can't do today.
TL;DR:
- Voice AI hit production quality in 2025. By May 2026, sub-second latency and natural turn-taking are table stakes.
- The stack is well-defined: Deepgram or AssemblyAI for STT, GPT-5 or Claude for reasoning, ElevenLabs or Cartesia for TTS. Vapi, Retell, and Bland bundle the layers.
- What works: appointment booking, order status, after-hours triage, outbound reminders.
- What doesn't: complex troubleshooting, emotional escalation, regulated workflows without human oversight.
- Cost per call: 8 to 25 cents per minute, plus telephony. Often cheaper than a human agent within 60 days.
Voice AI for support was a demo in 2023 and a credible tool by late 2025. In 2026 it's hitting the inflection point where the question stops being "does it sound real" and starts being "does the workflow design hold up." The voices are good. The reasoning is good. The cost is reasonable. What still varies wildly is whether teams set up the use case correctly.
Here's the honest picture of what voice AI can do for support in May 2026, what it can't, and how to pick a stack that doesn't leave you tied to a vendor that's about to get acquired.
Voice agents are three layers in a trench coat:
Speech-to-text (STT): Converts what the customer says into text the model can read. Leaders in 2026: Deepgram (Nova-3 generation), AssemblyAI, OpenAI Whisper-v4. Sub-300ms streaming latency. Word error rates under 5% on clean audio for English.
LLM reasoning: The brain. GPT-5, Claude 4.5 Sonnet, Gemini 2.5 Pro are the typical choices. Some platforms offer routing: cheap model for simple intents, expensive for complex turns.
Text-to-speech (TTS): Generates the audio response. ElevenLabs Flash (sub-100ms first token), Cartesia Sonic, OpenAI's voice models, Play.ht. Voice quality is no longer the bottleneck.
Orchestration platforms bundle these layers and handle telephony, turn-taking, interruptions, and the WebRTC plumbing. The names that matter:
Each one offers a working voice agent in under a day. The differences show up at scale and at the edges.
Pricing as of May 2026, sourced from public pricing pages. Verify before signing.
| Platform | Pricing model | Latency (first response) | Best for | Notable customers |
|---|---|---|---|---|
| Vapi | ~$0.05 to $0.13 per minute + LLM/TTS pass-through | 700ms to 1.2s | Custom inbound and outbound | Many YC startups; public via case studies |
| Retell AI | ~$0.07 to $0.15 per minute bundled | 600ms to 1s | Inbound support, high-quality conversation | Public case studies in healthcare, real estate |
| Bland AI | ~$0.09 per minute | 900ms to 1.5s | Outbound campaigns, scheduling | Public sales/lead-gen case studies |
| Poly AI | Enterprise (contact sales, typically $0.30+ per minute all-in) | Sub-1s | Large contact centers | Domino's, FedEx (publicly disclosed) |
| Voiceflow | Subscription + usage; voice add-on | 800ms to 1.3s | Teams without engineering | Trivago, Spotify (case studies) |
Pricing structure note: "per minute" rates here are platform fees. You also pay for the LLM (typically $0.005 to $0.02 per minute depending on context size), TTS (ElevenLabs Flash is roughly $0.05 per minute at typical speech rate), and telephony (Twilio inbound is around $0.013 per minute in the US). Total all-in: 8 to 25 cents per minute for most production deployments.
Three categories of workflow are working well in production by mid-2026:
The cleanest use case. Low-stakes, high-volume, scripted enough that the model can't go far off.
Working examples:
ROI is measurable and fast. A practice with 200 daily appointments paying a receptionist to call them is paying ~$8,000/month for that one task. A voice agent doing the same work costs $400 to $800/month. Industry deployments at dental and medical chains have published 60 to 90 day payback periods.
A voice agent picks up, identifies intent, and either resolves the question or routes to a human queue.
Working examples:
These workflows replace tier-zero call routing. The voice agent handles the 30 to 50 percent of calls that don't need a human, freeing the queue for harder cases.
Voice agents shine when the alternative is voicemail. A call comes in at 11pm. The agent identifies whether it's urgent (lockout, outage, emergency) and either pages an on-call human or schedules a callback for the morning.
The bar for the agent isn't perfection. The bar is "better than voicemail and a callback the next afternoon." That's a low bar, and the agent clears it.
The honest list, with reasons.
Voice is harder than chat for diagnostic work. The customer can't paste an error message. The agent can't show a screenshot. Both sides have to hold technical detail in working memory, which falls apart fast over voice.
Examples that fail: "my router has three blinking lights and the app won't connect" or "my dashboard shows error code 4291 when I try to export." A chat-based agent can guide through these step by step with screenshots and links. A voice agent can't.
A frustrated customer cancelling a subscription, a grieving customer dealing with an estate, a complaint that's been escalating for weeks. Voice agents handle these worse than chat agents because intonation matters and the model's "voice" is still uniformly polite.
What good teams do: train the agent to recognize emotional escalation signals (raised voice, repeated phrases, certain keywords) and route to a human within the first 30 seconds.
HIPAA, financial advice, legal advice, anything where being wrong has compliance consequences. Voice agents can collect information and route to a licensed human. They cannot be the licensed human. Most regulated industries are using voice AI for intake and triage, with humans handling the actual advice.
STT quality on accented English is improving but uneven. Deepgram and AssemblyAI both publish accent-specific WERs that show meaningful gaps for some accents. Non-English: major languages (Spanish, French, German, Mandarin, Japanese, Portuguese) work well. Long-tail languages are still rough.
Test on real audio from your actual customer base before committing. The demo always sounds great. Your customers might sound different.
The single biggest factor in whether a voice agent feels natural is end-to-end latency: the time between the customer finishing their sentence and the agent starting to respond.
The targets:
The components add up:
| Layer | Typical latency |
|---|---|
| STT (streaming, end of utterance detection) | 150ms to 400ms |
| LLM first token | 300ms to 800ms |
| TTS first audio chunk | 80ms to 200ms |
| Network round-trips | 100ms to 300ms |
If your LLM is slow (a reasoning model like o-series, or a long context), you blow the budget. Most production voice agents in 2026 use a fast model for the conversational layer (Claude 4.5 Haiku, GPT-5 Mini, Gemini 2.5 Flash) and reserve heavier models for specific complex turns.
The thing that separated 2024 voice agents from 2026 ones isn't intelligence. It's the conversational mechanics.
A 2024 agent waited for the customer to fully stop speaking, then started its turn. A long pause meant the agent talked over the customer or the customer wondered if the line had dropped.
A 2026 agent does voice activity detection in real time, predicts when the customer is about to finish, starts generating its response slightly before, and gracefully yields if the customer keeps talking. Interruption handling is bidirectional: the customer can cut the agent off mid-sentence and the agent stops, listens, and responds to the new input.
Vapi, Retell, and Poly all handle this reasonably well out of the box. If you're rolling your own stack with raw STT and TTS, you'll spend weeks getting turn-taking right.
A realistic mid-sized deployment: 500 calls per day, average call length 3 minutes. That's 1,500 minutes per day, or about 45,000 minutes per month.
| Cost line | Per minute | Monthly |
|---|---|---|
| Platform (Vapi or Retell) | $0.08 | $3,600 |
| LLM (GPT-5 or Claude 4.5 Sonnet) | $0.010 | $450 |
| TTS (ElevenLabs Flash) | $0.05 | $2,250 |
| Telephony (Twilio US inbound) | $0.013 | $585 |
| Total | $0.15 | $6,885 |
That's roughly $7K/month for 500 calls/day. A single human agent at $20/hour handling those calls would cost $13K to $16K/month at typical occupancy rates. The economics work even on the high end of voice AI pricing, as long as the agent actually resolves or routes most calls.
The breakeven changes if your voice agent resolves only 20 percent of calls and hands the rest to humans. Now you're paying for both. Run the math on your expected resolution rate before committing.
Voice AI is the wrong tool when:
What is voice customer support? Voice customer support is anything where the customer interacts with your team through audio: phone calls, voice-driven IVR, or voice notes. "Voice AI customer support" specifically means using an AI agent to handle some or all of that conversation, typically replacing or supplementing the work of a human contact center agent.
Is there AI customer service that works on the phone? Yes. As of 2026, voice AI customer service is in production at thousands of companies. Domino's, FedEx, Wendy's, and many healthcare and home services brands have publicly disclosed deployments. The technology hit production quality in late 2024 and is now a normal part of the contact center stack.
Will customers know they're talking to an AI? Most can tell within 30 to 60 seconds. The voice quality is excellent, but cadence, perfect pronunciation, and never sounding tired or annoyed are tells. Best practice in 2026 is to disclose: "Hi, I'm an AI assistant. I can help with order status, scheduling, and most common questions. If you need a person, just say 'agent.'" Customers handle disclosure fine. They don't handle deception well.
Is voice AI more expensive than chat AI? Per interaction, yes. A 3-minute voice call costs 30 to 75 cents. A 10-turn chat session costs 5 to 20 cents. But voice often resolves faster (one call vs five back-and-forth messages over a day), and the cost per resolved ticket is sometimes lower. Run the comparison on your own data, not on marketing material.
Voice AI in 2026 is not a science project anymore. The tools work. The economics work for high-volume, well-bounded workflows. The mistake to avoid is treating it as a general-purpose support replacement. It's a channel, not a strategy.
Start with one workflow where the call pattern is repetitive and the stakes are bounded: appointment reminders, order status, after-hours triage. Measure resolution rate. Layer in more workflows once the first one is stable. Don't try to replace your tier-one team in one go.
Need a support stack that handles chat first and routes voice intelligently? Try Chatsy free or see pricing.
Model Context Protocol turns 18 months old and finally has enough servers to matter. Here's how support teams should actually use it.