A context window is the maximum number of tokens that a large language model can process in a single request, encompassing all input (system prompt, conversation history, retrieved context) and output (the generated response). It represents the model's effective working memory.
Every LLM has a fixed context window size:
The context window must contain everything the model needs to generate a response: system prompt, RAG-retrieved passages, conversation history, and space for the output. When the total exceeds the context window, content must be truncated or summarized.
For customer support chatbots, context window management involves balancing: - Enough RAG context for accurate answers (more context = more accurate) - Enough conversation history for continuity (more history = better multi-turn) - Space for a complete response (too little space = truncated answers) - Token costs (more tokens in context = higher per-conversation cost)
In practice, context window should be evaluated by what it changes in the support workflow. Ask whether it improves answer accuracy, reduces repeated agent work, clarifies handoff decisions, or makes reporting easier. If the answer is only "it sounds modern," the concept is not yet operational.
A concrete example is rag context allocation for accuracy: For a factual question like "What is your refund policy?", the system allocates 60% of the context window to RAG passages (retrieving 5-8 relevant article sections) and 10% to conversation history. This maximizes the chance of including the correct answer in the context.
The simplest takeaway is: The context window is the total token capacity for input and output in a single LLM request
Not necessarily. While more context allows the model to consider more information, including irrelevant passages can actually decrease answer quality (the "lost in the middle" problem). Selective, high-quality context often outperforms large volumes of mediocre context.
The system must truncate or summarize older content. Well-designed chatbots use a sliding window approach: summarizing older messages, keeping recent ones in full, and always preserving the system prompt and fresh RAG context. Poorly designed systems simply cut off content, losing important context.
A knowledge base article is 3,000 tokens long, but only 800 tokens of context space are available. The system retrieves only the most relevant section of the article (the paragraph matching the query) rather than the full article, fitting within the available window.
For most customer support use cases, 32K tokens is sufficient. This comfortably holds a system prompt (500 tokens), RAG context (2,000-4,000 tokens), 10-15 conversation messages (2,000-3,000 tokens), and space for response generation. Very few support conversations need the 128K+ windows available in modern models.
Yes. LLM APIs charge per input token, so filling the context window with more RAG passages or longer conversation history directly increases the cost per interaction. This is why intelligent context management, selecting only the most relevant passages and summarizing old messages, is important for cost optimization.