Claude 4.5 has 200k context. Gemini 2.5 hits 2M. Some teams are killing RAG and stuffing everything in. Here is when that actually works for support.
TL;DR:
- Long-context models in 2026 are big enough to fit most support knowledge bases: Claude Sonnet 4.5 and Opus 4.5 hit 200k tokens, GPT-5 takes 128k input, Gemini 2.5 Pro takes 2M.
- Stuffing a full KB into the prompt is simpler than RAG but costs roughly $0.60 per query at 200k Claude tokens. At 10,000 queries per month that is $6,000.
- RAG stays cheaper, faster (300ms first-token vs 1 to 3 seconds), and easier to update incrementally.
- The right answer is hybrid: long context for recent conversation memory and the active ticket, RAG for the knowledge base.
- Long context wins when your KB is under 100 documents, queries are infrequent, or you are still prototyping.
In 2024 nobody seriously argued you could skip RAG. Context windows were 8k, 32k on a good day. In 2026 the question is real. Claude Opus 4.5 has 200k. Gemini 2.5 Pro has 2 million. GPT-5 has 128k. You can fit a 500-page knowledge base in the prompt and just ask the question.
Some teams have done exactly that and quietly killed their vector database. Some have gotten burned. Here is the honest tradeoff for customer support.
RAG (retrieval-augmented generation): Index your knowledge base into a vector database (Pinecone, pgvector, Weaviate). At query time, find the 5 to 20 most relevant chunks, paste them into the prompt, generate the answer. The model only ever sees a few thousand tokens of context per request.
Long context: Skip retrieval. Paste the entire knowledge base into the prompt every time. The model reads everything and figures out which parts matter. The prompt is 100k to 2M tokens depending on the model and KB size.
Both end up at the same place: a generated answer grounded in your content. The cost, latency, and accuracy profile is wildly different.
| Model | Input context | Output context | Input price per 1M tokens |
|---|---|---|---|
| Claude Opus 4.5 | 200k | 32k | $15 |
| Claude Sonnet 4.5 | 200k | 32k | $3 |
| GPT-5 | 128k | 16k | $5 |
| GPT-5 mini | 128k | 16k | $0.50 |
| Gemini 2.5 Pro | 2M | 8k | $1.25 (under 200k), $2.50 (over) |
| Gemini 2.5 Flash | 1M | 8k | $0.30 |
Prices verified May 2026. They change. Check anthropic.com/pricing, openai.com/api/pricing, and ai.google.dev/pricing before doing budget math.
For context: 200k tokens is roughly 150,000 words, or about a 400-page novel. A typical SaaS knowledge base of 500 help articles averaging 600 words each runs about 300,000 words, which is 400k tokens. Too big for Claude or GPT-5. Comfortable for Gemini 2.5 Pro.
Here is the honest version that gets you fired if you skip it.
Suppose you have a 100,000 token knowledge base (around 75,000 words, or 125 average help articles). Every customer query goes through the model.
Long context approach with Claude Sonnet 4.5:
At 10,000 queries per month: $3,100/month. At 50,000 queries per month: $15,500/month.
RAG approach with Claude Sonnet 4.5:
At 10,000 queries per month: $190/month (plus vector DB cost, typically $20 to $200/month for pgvector or Pinecone Starter). At 50,000 queries per month: $950/month.
The RAG approach costs 16x less. At scale, the gap is what makes or breaks the unit economics of an AI support product.
Prompt caching closes some of the gap. Anthropic and Google both support caching the static portion of the prompt. With Claude prompt caching, that 100k KB drops to $0.30 per 1M tokens on cache hit (a 10x discount). Cost per query drops to roughly $0.034. Still nearly 2x the RAG cost, but a real improvement. Caching only helps if the cached content stays unchanged for at least 5 minutes (Anthropic) or until evicted (Google).
Time-to-first-token is the user-visible metric for chat.
RAG: Vector search (50 to 200ms) plus first-token from the LLM with a small prompt (200 to 400ms). Total: roughly 300ms to 600ms before the user sees the first character.
Long context with 100k tokens: No retrieval, but first-token-time with a large prompt is dramatically slower. In our internal testing with Claude Sonnet 4.5, first-token-time was 1.2 to 2.1 seconds for a 100k token prompt and 2 to 4 seconds for a 200k prompt. Gemini 2.5 Pro at 1M tokens was 4 to 8 seconds. Verify with your own benchmarks: latency varies by region, traffic, and tier.
For an inline chat widget on a marketing site, the difference between 400ms and 2 seconds is the difference between "this feels instant" and "this feels broken." Users do not wait politely.
For an internal support tool where the agent expects a 3-second think time anyway, the latency hit matters less.
The intuitive argument for long context is "more context is better." The empirical research says it depends.
The arXiv paper "RAG or Long-Context LLMs? A Comprehensive Study and Hybrid Approach" (arXiv:2407.16833, Li et al., 2024) tested both approaches across nine benchmarks. The summary:
A more recent study from Meilisearch (their March 2026 comparison post) tested long-context retrieval specifically and found the "lost in the middle" problem persists even in 200k context windows. Information in the middle of a long prompt is retrieved with 10 to 20% lower accuracy than information at the start or end. For a support KB where the relevant answer could be anywhere in 500 articles, that is a real accuracy tax.
The honest read: long context is competitive on accuracy if your KB fits and you can absorb the cost. RAG is cheaper, faster, and competitive enough for most support workloads.
| Dimension | RAG | Long context |
|---|---|---|
| Cost per query (100k KB, Sonnet 4.5) | $0.019 | $0.31 (or $0.034 with prompt caching) |
| First-token latency | 300 to 600ms | 1.2 to 8 seconds depending on size |
| Knowledge base size ceiling | Effectively unlimited | 200k tokens (Claude/GPT) or 2M (Gemini) |
| Update freshness | Add or update a doc, reindex one chunk | Replace the full prompt |
| Setup complexity | Vector DB + chunking + retrieval logic | Paste KB into prompt |
| Multi-tenancy | Easy (filter by customer in vector query) | Hard (need separate prompts per customer) |
| Best for | High volume, large KB, cost-sensitive | Small KB, low volume, prototyping |
| Failure mode | Retrieves wrong chunk, model answers wrong | Misses relevant fact in long prompt |
There are real cases where stuffing the prompt is the right call.
Small knowledge bases under 100 documents. If your entire help center is 80 articles averaging 500 words each (40,000 total words, around 55k tokens), the full KB comfortably fits in any modern model. The cost per query at Sonnet 4.5 with prompt caching is roughly $0.02 to $0.04. The complexity of running a vector DB is hard to justify.
Conversation memory. If you want the agent to remember a 50-turn conversation across days or sessions, long context is the natural fit. RAG over conversation history is awkward: you would need to chunk and embed the user's own past messages, which is technically possible but rarely worth the complexity.
Prototyping or proof-of-concept. Before you commit to RAG infrastructure, validate the AI agent works at all. Paste your KB into a Claude prompt, see if the answers are useful, then decide whether to invest in retrieval.
Long structured documents. A 50-page policy PDF, a complex API spec, or a legal contract that needs to be reasoned about as a whole. RAG splits these into chunks and often loses cross-references. Long context keeps the document intact.
Internal copilots with low query volume. An employee-facing tool that handles 100 queries per day across all employees. At 100 queries per day, even $0.30 per query is $900 per month. Tolerable if the alternative is a week of engineering time on RAG infrastructure.
The other side.
Large knowledge bases above 200 documents. Once your KB exceeds 200k tokens (the Claude and GPT ceiling), long-context is not an option for those models. Gemini's 2M window remains, but latency is brutal and the lost-in-the-middle problem gets worse.
High volume. Above 10,000 queries per month, the cost gap (16x at our example) becomes the dominant factor. RAG is the only economically sane choice.
Frequent KB updates. If your KB changes daily (typical for ecommerce product info, SaaS feature releases, policy updates), RAG lets you reindex one chunk at a time. Long context invalidates the entire cached prompt every update.
Multi-tenant SaaS. If you serve 500 customer accounts each with their own KB, long context means 500 separate prompts to maintain. RAG with metadata filtering ("retrieve chunks where customer_id = X") is dramatically cleaner.
Cost-sensitive workflows. Anything where you charge per resolution and the unit economics matter. Long context turns a 2-cent margin per query into a 30-cent margin per query in the wrong direction.
The actual production answer for most support workloads in 2026 is not pure RAG or pure long context. It is a hybrid.
Conversation memory: long context. The current session's chat history (last 20 to 50 turns) lives in the prompt. This is what gives the agent continuity within a session and the ability to reference earlier messages.
Knowledge base: RAG. The help articles, product docs, and policy content live in a vector database. Retrieve the top 5 to 10 chunks per query, paste them in the prompt.
Account state: tool call. Account-specific data (subscription tier, recent orders, open tickets) is fetched on demand via a tool call, not stuffed in the prompt or indexed.
Long structured documents: long context fallback. When the agent needs to reason about a full policy document or a long contract, it can request the entire document as a tool call result. This is rare but worth supporting.
Chatsy's own architecture follows this pattern. Conversation memory and recent context sit in the prompt. The knowledge base is indexed and retrieved per query. Tool calls fetch account state. We do not stuff the full KB in every request because the math does not work.
Three caveats worth naming.
Prices keep dropping. GPT-5 mini at $0.50 per 1M input tokens makes the long-context math look very different. If you can tolerate the accuracy of a smaller model, long context with GPT-5 mini at 128k is roughly $0.07 per query. That puts the 10k query/month bill at $700, not $3,100. RAG still wins, but the gap is narrower.
Caching gets better. Anthropic and Google are both iterating fast on prompt caching. If a 1-hour cache TTL becomes standard at a 95% discount, long-context economics shift again.
Your KB might be tiny. If you have 30 help articles, none of this matters. Just paste them in a Claude prompt and ship. RAG infrastructure is overkill for a 30-doc knowledge base.
A short list of when you should stick with RAG:
If three or more apply, do not get talked into the long-context-only approach by an LLM provider's keynote slide.
No. RAG is cheaper, faster, and scales better than long context for most production workloads. It is more correctly described as evolving: prompt caching, hybrid approaches, and better embeddings have changed the optimal configuration, but the core idea (retrieve, then generate) is still the right default for support at any meaningful scale.
Technically yes. At $1.25 per 1M input tokens for prompts under 200k (and $2.50 above), a 1M token KB costs about $2.50 per query. At 10,000 queries per month that is $25,000. The accuracy also suffers from the lost-in-the-middle problem. Most teams that try this approach end up adding RAG within six months for cost or latency reasons.
Closer, but not equivalent. Anthropic's prompt caching offers a 10x discount on cached input tokens for cached reads. A 100k KB drops from $0.30 per query to $0.03 per query. RAG is still around $0.02 per query at the same model. The remaining gap is small enough that you might pick long context for simplicity if your team is short on engineering capacity.
That is the hybrid pattern we recommend and what most production systems actually run in 2026. Conversation context (recent turns) lives in the prompt because retrieval over short, session-scoped messages is awkward. The knowledge base (large, slow-changing, multi-tenant) lives behind RAG. This is the default architecture for Chatsy.
If you are building from scratch, do not over-engineer. Start with long context on a small KB and a single model. If you outgrow it (cost, latency, KB size), add RAG.
If you are buying, ask the vendor what they actually do. A vendor that claims "we use the full 2M Gemini context for every query" is either lying or burning investor money. A vendor that says "we use RAG with hybrid search, with long context for conversation memory" is probably running a sensible architecture.
Chatsy runs the hybrid pattern by default. RAG over your knowledge base, conversation memory in the prompt, tool calls for account state. You pick the model (GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, and 25+ others) and we handle the architecture. Free plan covers 40 messages per month. Paid plans run $35 to $475 monthly. See pricing for the full breakdown.
Semantic AI search vs keyword search for docs and support. Real costs, latency, and the four scenarios where each one wins.