Context Window
A context window is the maximum number of tokens that a large language model can process in a single request, encompassing all input (system prompt, conversation history, retrieved context) and output (the generated response). It represents the model's effective working memory.
How it works
Every LLM has a fixed context window size:
- **GPT-5**: 128K-1M tokens
- **Claude 4.5**: 200K tokens
- **Gemini 3**: 1M+ tokens
- **Llama models**: 8K-128K tokens depending on version
The context window must contain everything the model needs to generate a response: system prompt, RAG-retrieved passages, conversation history, and space for the output. When the total exceeds the context window, content must be truncated or summarized.
For customer support chatbots, context window management involves balancing: - Enough RAG context for accurate answers (more context = more accurate) - Enough conversation history for continuity (more history = better multi-turn) - Space for a complete response (too little space = truncated answers) - Token costs (more tokens in context = higher per-conversation cost)
Why it matters
How Chatsy uses context window
Real-world examples
Key takeaways
Frequently asked questions
Does a larger context window always mean better answers?
Not necessarily. While more context allows the model to consider more information, including irrelevant passages can actually decrease answer quality (the "lost in the middle" problem). Selective, high-quality context often outperforms large volumes of mediocre context.
What happens when a conversation exceeds the context window?
The system must truncate or summarize older content. Well-designed chatbots use a sliding window approach: summarizing older messages, keeping recent ones in full, and always preserving the system prompt and fresh RAG context. Poorly designed systems simply cut off content, losing important context.
How much context window does a support chatbot need?
For most customer support use cases, 32K tokens is sufficient. This comfortably holds a system prompt (500 tokens), RAG context (2,000-4,000 tokens), 10-15 conversation messages (2,000-3,000 tokens), and space for response generation. Very few support conversations need the 128K+ windows available in modern models.
Does using more of the context window increase cost?
Yes. LLM APIs charge per input token, so filling the context window with more RAG passages or longer conversation history directly increases the cost per interaction. This is why intelligent context management — selecting only the most relevant passages and summarizing old messages — is important for cost optimization.