Chatsy
Glossary

Context Window

A context window is the maximum number of tokens that a large language model can process in a single request, encompassing all input (system prompt, conversation history, retrieved context) and output (the generated response). It represents the model's effective working memory.

How it works

Every LLM has a fixed context window size:

  • **GPT-5**: 128K-1M tokens
  • **Claude 4.5**: 200K tokens
  • **Gemini 3**: 1M+ tokens
  • **Llama models**: 8K-128K tokens depending on version

The context window must contain everything the model needs to generate a response: system prompt, RAG-retrieved passages, conversation history, and space for the output. When the total exceeds the context window, content must be truncated or summarized.

For customer support chatbots, context window management involves balancing: - Enough RAG context for accurate answers (more context = more accurate) - Enough conversation history for continuity (more history = better multi-turn) - Space for a complete response (too little space = truncated answers) - Token costs (more tokens in context = higher per-conversation cost)

Why it matters

Context window size determines how much information your chatbot can consider when generating a response. A larger context window allows more RAG passages, longer conversation history, and more detailed system instructions — generally producing better answers. However, larger contexts also cost more and can increase latency.

How Chatsy uses context window

Chatsy dynamically manages the context window based on conversation length and complexity. Short conversations get more space for RAG context (better retrieval accuracy). Long multi-turn conversations use a sliding window that summarizes older messages while preserving recent context. This ensures optimal use of available tokens regardless of conversation length.

Real-world examples

RAG context allocation for accuracy

For a factual question like "What is your refund policy?", the system allocates 60% of the context window to RAG passages (retrieving 5-8 relevant article sections) and 10% to conversation history. This maximizes the chance of including the correct answer in the context.

Long conversation management

A customer is on message 20 of a troubleshooting conversation. The full history would consume 4,000 tokens. The system summarizes messages 1-12 into a 200-token summary and keeps messages 13-20 in full, preserving the recent context while staying within limits.

Context window overflow handling

A knowledge base article is 3,000 tokens long, but only 800 tokens of context space are available. The system retrieves only the most relevant section of the article (the paragraph matching the query) rather than the full article, fitting within the available window.

Key takeaways

  • The context window is the total token capacity for input and output in a single LLM request

  • Modern models range from 8K to 1M+ tokens, with larger windows enabling more context at higher cost

  • Context must be split between system prompt, RAG passages, conversation history, and response space

  • Smart context management (summarization, selective retrieval) is more effective than simply using the largest model

  • For customer support, 32K-128K token windows are sufficient for virtually all conversation scenarios

Frequently asked questions

Does a larger context window always mean better answers?

Not necessarily. While more context allows the model to consider more information, including irrelevant passages can actually decrease answer quality (the "lost in the middle" problem). Selective, high-quality context often outperforms large volumes of mediocre context.

What happens when a conversation exceeds the context window?

The system must truncate or summarize older content. Well-designed chatbots use a sliding window approach: summarizing older messages, keeping recent ones in full, and always preserving the system prompt and fresh RAG context. Poorly designed systems simply cut off content, losing important context.

How much context window does a support chatbot need?

For most customer support use cases, 32K tokens is sufficient. This comfortably holds a system prompt (500 tokens), RAG context (2,000-4,000 tokens), 10-15 conversation messages (2,000-3,000 tokens), and space for response generation. Very few support conversations need the 128K+ windows available in modern models.

Does using more of the context window increase cost?

Yes. LLM APIs charge per input token, so filling the context window with more RAG passages or longer conversation history directly increases the cost per interaction. This is why intelligent context management — selecting only the most relevant passages and summarizing old messages — is important for cost optimization.

Related terms

Further reading

Related Resources

See context window in action

Try Chatsy free and experience how these concepts come together in an AI-powered support platform.

Start Free

Browse the glossary