Chatsy

Advanced RAG Optimization: Chunking, Re-ranking & Hybrid Retrieval

Go beyond basic RAG with production-grade chunking strategies, cross-encoder re-ranking, query transformations, and hybrid retrieval pipelines --- with code examples and evaluation metrics.

Asad Ali
Founder & CEO
March 30, 2026
20 min read
Share:
Featured image for article: Advanced RAG Optimization: Chunking, Re-ranking & Hybrid Retrieval - Technical guide by Asad Ali

Basic RAG is straightforward: chunk your documents, embed them, retrieve the top-K most similar chunks, and feed them to an LLM. It works well on small knowledge bases with clear, well-structured content. But at scale --- thousands of documents, ambiguous queries, mixed content types --- basic RAG falls apart. Answers become irrelevant, hallucinations increase, and users lose trust.

This guide covers the techniques that bridge the gap between a RAG demo and a production system: advanced chunking, re-ranking, query transformation, hybrid retrieval, and rigorous evaluation.

TL;DR:

  • Basic RAG fails at scale because fixed-size chunking breaks semantic boundaries, single-vector retrieval misses lexical matches, and top-K ranking without re-scoring returns noisy results.
  • Use semantic or document-aware chunking to preserve meaning, hybrid retrieval (BM25 + dense vectors) to cover both keyword and semantic matches, and cross-encoder re-ranking to push the most relevant results to the top.
  • Query transformation techniques like HyDE and multi-query expansion dramatically improve recall on ambiguous or under-specified queries. Measure everything with MRR, NDCG, and Recall@K.

Why Basic RAG Fails at Scale

Consider a knowledge base with 10,000 documents covering product documentation, API references, troubleshooting guides, and policy pages. A customer asks: "Why is my webhook not firing?"

Basic RAG with fixed-size chunking and pure vector search might return:

  1. A chunk about webhook configuration (relevant)
  2. A chunk about webhook pricing tiers (irrelevant --- "webhook" keyword match inflated the embedding similarity)
  3. A chunk from the changelog mentioning webhooks (irrelevant --- recency, not relevance)
  4. A chunk about API rate limits that mentions "firing" in a different context (irrelevant)

Only one of four retrieved chunks is useful. The LLM now has to generate an answer from 75% noise. This is the needle-in-a-haystack problem at scale, and it manifests through three failure modes:

Failure ModeRoot CauseImpact
Semantic boundary breaksFixed-size chunks split sentences and paragraphs mid-thoughtRetrieved chunks lack coherent context
Vocabulary mismatchPure vector search misses exact terms (error codes, product names)Relevant documents with exact matches rank lower than semantic near-misses
Ranking noiseTop-K by cosine similarity alone does not account for query-document relevance depthSuperficially similar but unhelpful chunks displace truly relevant ones

The rest of this guide addresses each failure mode with specific techniques and code.

Chunking Strategies Compared

Chunking is the most underestimated component of a RAG pipeline. The way you split documents determines the upper bound of your retrieval quality --- no amount of re-ranking can fix a chunk that lost its meaning at the boundary.

Fixed-Size Chunking

Split text every N tokens with an overlap window.

typescript
function fixedSizeChunk(text: string, chunkSize: number, overlap: number): string[] { const words = text.split(/\s+/); const chunks: string[] = []; let start = 0; while (start < words.length) { const end = Math.min(start + chunkSize, words.length); chunks.push(words.slice(start, end).join(" ")); start += chunkSize - overlap; } return chunks; } // Usage: 500-word chunks with 100-word overlap const chunks = fixedSizeChunk(documentText, 500, 100);

Pros: Simple, predictable chunk sizes, works with any content. Cons: Splits mid-sentence, mid-paragraph, even mid-code-block. The overlap helps but does not solve the fundamental problem of breaking semantic units.

Best for: Quick prototypes and uniformly structured content.

Sentence-Based Chunking

Group sentences up to a token budget, respecting sentence boundaries.

typescript
function sentenceChunk(text: string, maxTokens: number): string[] { // Split on sentence boundaries const sentences = text.match(/[^.!?]+[.!?]+/g) || [text]; const chunks: string[] = []; let currentChunk: string[] = []; let currentTokens = 0; for (const sentence of sentences) { const sentenceTokens = estimateTokens(sentence); if (currentTokens + sentenceTokens > maxTokens && currentChunk.length > 0) { chunks.push(currentChunk.join(" ")); currentChunk = []; currentTokens = 0; } currentChunk.push(sentence.trim()); currentTokens += sentenceTokens; } if (currentChunk.length > 0) { chunks.push(currentChunk.join(" ")); } return chunks; }

Pros: Never splits mid-sentence. More coherent chunks than fixed-size. Cons: Sentence detection is fragile (abbreviations, decimal numbers, URLs). Does not respect higher-level structure like sections or paragraphs.

Semantic Chunking

Use embedding similarity between consecutive sentences to detect topic shifts. When similarity drops below a threshold, insert a chunk boundary.

python
import numpy as np from openai import OpenAI client = OpenAI() def semantic_chunk(text: str, threshold: float = 0.75, max_tokens: int = 800) -> list[str]: """Split text at semantic boundaries using embedding similarity.""" sentences = split_sentences(text) # Embed each sentence response = client.embeddings.create( input=sentences, model="text-embedding-3-small" ) embeddings = [e.embedding for e in response.data] # Compute cosine similarity between consecutive sentences chunks = [] current_chunk = [sentences[0]] current_tokens = estimate_tokens(sentences[0]) for i in range(1, len(sentences)): sim = cosine_similarity(embeddings[i - 1], embeddings[i]) sent_tokens = estimate_tokens(sentences[i]) # Break if similarity drops below threshold OR token budget exceeded if sim < threshold or current_tokens + sent_tokens > max_tokens: chunks.append(" ".join(current_chunk)) current_chunk = [sentences[i]] current_tokens = sent_tokens else: current_chunk.append(sentences[i]) current_tokens += sent_tokens if current_chunk: chunks.append(" ".join(current_chunk)) return chunks

Pros: Chunks align with topic boundaries. Adjacent sentences about the same topic stay together. Cons: Requires embedding every sentence (expensive for large corpora). Threshold tuning is dataset-specific. Can produce highly variable chunk sizes.

Document-Aware Chunking

Parse the document structure (headings, sections, code blocks, lists) and chunk at structural boundaries. This is the most effective strategy for structured content like documentation, help articles, and technical guides.

typescript
interface DocumentSection { heading: string; level: number; // h1=1, h2=2, etc. content: string; codeBlocks: string[]; } function documentAwareChunk( markdown: string, maxTokens: number = 800 ): Array<{ content: string; metadata: { heading: string; level: number } }> { const sections = parseMarkdownSections(markdown); const chunks: Array<{ content: string; metadata: { heading: string; level: number } }> = []; for (const section of sections) { const sectionText = `## ${section.heading}\n\n${section.content}`; const tokens = estimateTokens(sectionText); if (tokens <= maxTokens) { // Section fits in one chunk --- keep it whole chunks.push({ content: sectionText, metadata: { heading: section.heading, level: section.level }, }); } else { // Section too large --- split by paragraphs, keeping heading as prefix const paragraphs = section.content.split(/\n\n+/); let currentContent = `## ${section.heading}\n\n`; let currentTokens = estimateTokens(currentContent); for (const para of paragraphs) { const paraTokens = estimateTokens(para); if (currentTokens + paraTokens > maxTokens && currentTokens > estimateTokens(`## ${section.heading}\n\n`)) { chunks.push({ content: currentContent.trim(), metadata: { heading: section.heading, level: section.level }, }); currentContent = `## ${section.heading} (continued)\n\n`; currentTokens = estimateTokens(currentContent); } currentContent += para + "\n\n"; currentTokens += paraTokens; } if (currentTokens > estimateTokens(`## ${section.heading}\n\n`)) { chunks.push({ content: currentContent.trim(), metadata: { heading: section.heading, level: section.level }, }); } } } return chunks; }

Pros: Preserves document structure. Headings provide natural metadata for filtering. Code blocks stay intact. Cons: Requires a parser for each content format (Markdown, HTML, PDF). Does not work well for unstructured content like emails or chat logs.

Parent-Child Chunking

Store both large parent chunks and smaller child chunks. Retrieve on child chunks (more precise), but pass parent chunks to the LLM (more context).

typescript
interface ParentChildChunk { parentId: string; parentContent: string; // ~1500 tokens children: Array<{ childId: string; content: string; // ~300 tokens embedding: number[]; }>; } async function parentChildRetrieval( query: string, topK: number = 5, expandToParent: boolean = true ): Promise<string[]> { const queryEmbedding = await embed(query); // Search against child chunks for precision const childResults = await vectorDb.search({ embedding: queryEmbedding, table: "child_chunks", limit: topK, }); if (!expandToParent) { return childResults.map((r) => r.content); } // Expand to parent chunks, deduplicating const parentIds = [...new Set(childResults.map((r) => r.parentId))]; const parents = await db.query( `SELECT content FROM parent_chunks WHERE id = ANY($1)`, [parentIds] ); return parents.map((p) => p.content); }

Pros: Best of both worlds --- precise retrieval with rich context. The LLM gets surrounding paragraphs that help it reason. Cons: Storage overhead (you store both granularities). Requires a join at query time.

Chunking Strategy Decision Matrix

StrategyBest ForChunk QualityImplementation CostEmbedding Cost
Fixed-sizePrototypes, uniform textLowMinimalLow
Sentence-basedArticles, proseMediumLowLow
SemanticMixed-topic documentsHighMediumHigh (per-sentence)
Document-awareStructured docs (Markdown, HTML)HighMediumLow
Parent-childComplex docs needing contextHighestHighMedium

For most production systems, document-aware chunking with a parent-child expansion strategy offers the best trade-off.

Re-Ranking: Pushing the Best Results to the Top

Vector similarity gives you a rough ordering. Re-ranking refines it. A re-ranker takes the query and each candidate chunk as a pair and scores them for relevance --- typically using a cross-encoder model that sees both texts simultaneously.

Why Re-Ranking Matters

Bi-encoder search (the standard embedding approach) encodes query and document independently. It cannot capture fine-grained interactions between query terms and document terms. A cross-encoder processes both together, enabling attention across the query-document pair.

Bi-encoder (embedding search):
  Query  → Encoder → query_vec  ─┐
                                  ├→ cosine_sim → score
  Doc    → Encoder → doc_vec   ─┘

Cross-encoder (re-ranking):
  [Query + Doc] → Encoder → relevance_score

The cross-encoder is more accurate but also more expensive --- you cannot run it over millions of documents. The standard pattern is a two-stage pipeline:

  1. Stage 1 (Retrieval): Fast bi-encoder search returns top-50 to top-100 candidates.
  2. Stage 2 (Re-ranking): Cross-encoder scores and re-orders the candidates, returning top-5 to top-10.

Implementation with Cohere Rerank

typescript
import { CohereClient } from "cohere-ai"; const cohere = new CohereClient({ token: process.env.COHERE_API_KEY }); async function retrieveAndRerank( query: string, topK: number = 5 ): Promise<Array<{ content: string; relevanceScore: number }>> { // Stage 1: Retrieve top-50 candidates via vector search const candidates = await vectorSearch(query, 50); // Stage 2: Re-rank with Cohere const reranked = await cohere.rerank({ model: "rerank-english-v3.0", query, documents: candidates.map((c) => ({ text: c.content })), topN: topK, }); return reranked.results.map((r) => ({ content: candidates[r.index].content, relevanceScore: r.relevanceScore, })); }

ColBERT: Late Interaction Re-Ranking

ColBERT uses a "late interaction" approach: it encodes query and document separately (like a bi-encoder) but compares them at the token level rather than the vector level. Each query token attends to the most similar document token via MaxSim.

python
# ColBERT-style scoring (simplified) import torch def colbert_score(query_embeddings: torch.Tensor, doc_embeddings: torch.Tensor) -> float: """ query_embeddings: (num_query_tokens, dim) doc_embeddings: (num_doc_tokens, dim) """ # For each query token, find max similarity with any doc token similarity_matrix = query_embeddings @ doc_embeddings.T # (Q, D) max_sim_per_query_token = similarity_matrix.max(dim=1).values # (Q,) return max_sim_per_query_token.sum().item()

ColBERT is faster than a full cross-encoder because document embeddings can be precomputed and indexed. It is a strong middle ground when you need re-ranking quality without the latency of running a cross-encoder on every candidate.

When to Add Re-Ranking

SignalAction
Retrieval recall is high but precision is lowAdd re-ranking --- you are finding relevant docs but not surfacing them first
Retrieval recall is lowFix chunking and retrieval first --- re-ranking cannot surface docs you did not retrieve
Latency budget is tight (<500ms total)Use ColBERT or skip re-ranking; cross-encoders add 100--300ms
Knowledge base is small (<500 docs)Re-ranking may not be necessary; top-K from vector search is likely sufficient

Query Transformation

Sometimes the problem is not retrieval but the query itself. Users ask vague, under-specified, or poorly phrased questions. Query transformation techniques rewrite the query before retrieval to improve recall.

HyDE (Hypothetical Document Embedding)

Instead of embedding the raw query, ask the LLM to generate a hypothetical answer, then embed that answer and use it for retrieval. The hypothesis is closer in embedding space to actual relevant documents than the raw question.

typescript
async function hydeRetrieval(query: string, topK: number = 5): Promise<string[]> { // Step 1: Generate a hypothetical answer const hypothetical = await llm.chat({ model: "gpt-4o-mini", messages: [ { role: "system", content: `Given a question, write a short paragraph that would be a good answer to it. Do not say "I don't know." Write as if you are a knowledgeable support article, even if you have to guess.`, }, { role: "user", content: query }, ], }); // Step 2: Embed the hypothetical answer (not the original query) const hydeEmbedding = await embed(hypothetical.content); // Step 3: Search with the hypothetical embedding return vectorSearch(hydeEmbedding, topK); }

When HyDE helps: Vague queries like "it's not working" where the raw embedding is too generic. The hypothetical answer adds specificity. When HyDE hurts: Specific, well-formed queries where the hypothesis might hallucinate and drift from the actual intent.

Multi-Query Expansion

Generate multiple reformulations of the query, retrieve with each, and merge the results. This casts a wider net.

typescript
async function multiQueryRetrieval( query: string, topK: number = 5 ): Promise<string[]> { // Generate 3 reformulations const expansions = await llm.chat({ model: "gpt-4o-mini", messages: [ { role: "system", content: `Generate 3 alternative phrasings of the following question. Return as a JSON array of strings. Each rephrasing should approach the question from a different angle.`, }, { role: "user", content: query }, ], response_format: { type: "json_object" }, }); const queries = [query, ...JSON.parse(expansions.content).queries]; // Retrieve with each query const allResults = await Promise.all( queries.map((q) => vectorSearch(q, topK * 2)) ); // Merge with Reciprocal Rank Fusion return reciprocalRankFusion(allResults, topK); }

Step-Back Prompting

For specific queries, generate a more general "step-back" question that retrieves broader context. Then retrieve with both the original and step-back query.

typescript
// Original: "Why does my webhook return 403 when using OAuth?" // Step-back: "How does webhook authentication work?" async function stepBackRetrieval(query: string, topK: number = 5): Promise<string[]> { const stepBack = await llm.chat({ model: "gpt-4o-mini", messages: [ { role: "system", content: `Given a specific question, generate a more general question that would retrieve useful background context. Return only the question.`, }, { role: "user", content: query }, ], }); const [specificResults, broadResults] = await Promise.all([ vectorSearch(query, topK), vectorSearch(stepBack.content, topK), ]); // Combine: prioritize specific results, supplement with broad context return deduplicateAndMerge(specificResults, broadResults, topK); }

Hybrid Retrieval Pipeline

The highest-performing retrieval systems combine multiple search methods and merge results. Here is a production pipeline combining BM25 keyword search, dense vector search, and metadata filtering:

typescript
interface RetrievalResult { chunkId: string; content: string; score: number; source: "bm25" | "vector" | "metadata"; } async function hybridRetrieval( query: string, filters?: { category?: string; updatedAfter?: string }, topK: number = 10 ): Promise<RetrievalResult[]> { // Run all three retrieval methods in parallel const [bm25Results, vectorResults, metadataResults] = await Promise.all([ // BM25 keyword search (handles exact matches, error codes, product names) bm25Search(query, topK * 3), // Dense vector search (handles semantic similarity) vectorSearch(query, topK * 3), // Metadata-filtered search (narrows by category, date, etc.) filters ? metadataFilteredSearch(query, filters, topK * 2) : Promise.resolve([]), ]); // Merge with Reciprocal Rank Fusion return reciprocalRankFusion( [bm25Results, vectorResults, metadataResults], topK ); } function reciprocalRankFusion( resultSets: RetrievalResult[][], topK: number, k: number = 60 // RRF constant ): RetrievalResult[] { const scores = new Map<string, { score: number; result: RetrievalResult }>(); for (const results of resultSets) { for (let rank = 0; rank < results.length; rank++) { const result = results[rank]; const rrfScore = 1 / (k + rank + 1); const existing = scores.get(result.chunkId); if (existing) { existing.score += rrfScore; } else { scores.set(result.chunkId, { score: rrfScore, result }); } } } return [...scores.values()] .sort((a, b) => b.score - a.score) .slice(0, topK) .map((s) => ({ ...s.result, score: s.score })); }

Full Pipeline: Retrieve, Fuse, Re-Rank, Generate

Putting it all together:

typescript
async function advancedRAGPipeline( query: string, conversationHistory: Message[] ): Promise<string> { // 1. Query transformation: expand into multiple queries const expandedQueries = await expandQuery(query); // 2. Hybrid retrieval for each expanded query const allResults = await Promise.all( expandedQueries.map((q) => hybridRetrieval(q, undefined, 20)) ); // 3. Fuse results across all queries const fused = reciprocalRankFusion(allResults, 20); // 4. Re-rank with cross-encoder const reranked = await cohereRerank(query, fused, 5); // 5. Assemble context and generate const context = reranked.map((r) => r.content).join("\n\n---\n\n"); const response = await llm.chat({ model: "gpt-4o", messages: [ { role: "system", content: `Answer the user's question using ONLY the provided context. If the context does not contain enough information, say so. Do not make up information. Context: ${context}`, }, ...conversationHistory, { role: "user", content: query }, ], }); return response.content; }

Evaluation Metrics

You cannot optimize what you do not measure. Here are the metrics that matter for RAG retrieval quality:

MRR (Mean Reciprocal Rank)

The average of 1/rank for the first relevant result across queries. MRR = 1.0 means the correct answer is always first.

python
def mean_reciprocal_rank(queries: list[dict]) -> float: """ queries: [{ "query": str, "retrieved": [str], "relevant": set[str] }] """ reciprocal_ranks = [] for q in queries: for rank, doc_id in enumerate(q["retrieved"], 1): if doc_id in q["relevant"]: reciprocal_ranks.append(1.0 / rank) break else: reciprocal_ranks.append(0.0) return sum(reciprocal_ranks) / len(reciprocal_ranks)

NDCG (Normalized Discounted Cumulative Gain)

Accounts for the position of all relevant results, not just the first. Higher-ranked relevant results contribute more to the score.

python
import numpy as np def ndcg_at_k(retrieved: list[str], relevant: dict[str, int], k: int) -> float: """ relevant: { doc_id: relevance_grade } where grade is 0, 1, 2, 3 """ dcg = 0.0 for i, doc_id in enumerate(retrieved[:k]): rel = relevant.get(doc_id, 0) dcg += (2**rel - 1) / np.log2(i + 2) # i+2 because log2(1) = 0 # Ideal DCG: sort by relevance grade descending ideal_rels = sorted(relevant.values(), reverse=True)[:k] idcg = sum((2**rel - 1) / np.log2(i + 2) for i, rel in enumerate(ideal_rels)) return dcg / idcg if idcg > 0 else 0.0

Recall@K

What fraction of all relevant documents appear in the top-K results. Critical for ensuring you do not miss important context.

python
def recall_at_k(retrieved: list[str], relevant: set[str], k: int) -> float: retrieved_set = set(retrieved[:k]) return len(retrieved_set & relevant) / len(relevant) if relevant else 0.0

Practical Benchmarks

MetricBaseline (Basic RAG)+Semantic Chunking+Hybrid Retrieval+Re-ranking+Query Expansion
MRR0.520.610.710.820.85
NDCG@50.450.540.650.760.79
Recall@100.600.680.790.820.88

Each optimization layer compounds. The full pipeline typically achieves 60--80% improvement over basic RAG in retrieval quality.

Production Checklist

Caching

  • Embedding cache: Hash the input text and cache embeddings. Same query should not be embedded twice.
  • Result cache: Cache retrieval results for frequent queries (TTL of 5--15 minutes depending on update frequency).
  • LLM response cache: For identical query + context combinations, return cached responses.

Batching

  • Batch embedding requests: most APIs support up to 2,048 inputs per call. Group chunks during indexing.
  • Batch re-ranking: send all candidates in a single API call rather than one at a time.

Latency Budgets

StageBudgetTypical
Query expansion150ms80--120ms
Hybrid retrieval (parallel)100ms30--80ms
RRF fusion5ms1--3ms
Re-ranking200ms100--250ms
LLM generation2000ms800--2500ms
Total2500ms1000--3000ms

Retrieval should be the fast part. If retrieval exceeds 300ms, profile your vector index configuration and database connection pooling.

Monitoring

Track these in production:

  • Retrieval latency P50/P95/P99 per stage
  • Empty result rate: queries that return zero relevant chunks after re-ranking
  • Re-ranker score distribution: if scores cluster near zero, your retrieval is returning poor candidates
  • Cache hit rate: target >30% for embedding cache, >10% for result cache
  • Token usage: track tokens consumed by query expansion, context assembly, and generation separately

Key Takeaways

  1. Chunking is the foundation. No amount of re-ranking fixes a chunk that lost its meaning at a boundary. Use document-aware or semantic chunking for production systems.
  2. Hybrid retrieval is non-negotiable at scale. BM25 handles exact matches (error codes, names); vectors handle semantics. Combine with Reciprocal Rank Fusion.
  3. Re-ranking is the highest-ROI optimization. A cross-encoder re-ranker on top-50 candidates typically improves MRR by 15--20 points.
  4. Transform the query, not just the retrieval. HyDE, multi-query expansion, and step-back prompting improve recall on ambiguous queries.
  5. Measure relentlessly. MRR, NDCG, and Recall@K on a labeled eval set are the only reliable signals. Vibes-based evaluation does not scale.

Frequently Asked Questions

What is the difference between basic RAG and advanced RAG?

Basic RAG uses fixed-size chunking, single-vector retrieval, and top-K ranking by cosine similarity. Advanced RAG adds semantic or document-aware chunking, hybrid retrieval (BM25 + dense vectors), cross-encoder re-ranking, and query transformation techniques. The difference in practice is significant: advanced RAG typically achieves 60--80% higher retrieval quality (MRR, NDCG) on real-world knowledge bases compared to basic RAG.

Which chunking strategy should I use?

For structured content like documentation or help articles, use document-aware chunking that respects headings, sections, and code blocks. For unstructured content like emails or chat logs, use semantic chunking based on embedding similarity between consecutive sentences. If you need both precise retrieval and rich context for the LLM, add a parent-child layer where you retrieve on small child chunks but expand to larger parent chunks.

When should I add a re-ranker to my RAG pipeline?

Add a re-ranker when your retrieval recall is high but precision is low --- meaning relevant documents are in your candidate set but not ranked at the top. If your vector search returns 50 candidates and the correct answer is usually somewhere in the set but not in the top 5, a cross-encoder re-ranker will dramatically improve results. Skip re-ranking if your knowledge base is small (<500 documents) or your latency budget is very tight.

How do I evaluate my RAG pipeline?

Build a labeled evaluation set of at least 100 query-document pairs where humans have marked which documents are relevant to each query. Compute MRR (how quickly you find the first relevant result), NDCG@K (quality of the full ranking), and Recall@K (what fraction of relevant documents you retrieve). Run these metrics after each change to your pipeline --- chunking, retrieval, or re-ranking --- to measure the impact.

What is Reciprocal Rank Fusion and why is it used in hybrid retrieval?

Reciprocal Rank Fusion (RRF) merges ranked lists from different retrieval methods (e.g., BM25 and vector search) into a single ranking. For each document, RRF computes a score of 1/(k + rank) from each list and sums them. Documents that rank highly in multiple lists get boosted. RRF is preferred over simple score normalization because scores from different retrieval systems are not comparable (BM25 scores and cosine similarities have different scales and distributions). RRF only uses rank positions, making it robust across any combination of retrieval methods.


#RAG#retrieval-augmented-generation#optimization#chunking#re-ranking#ai-chatbot
Related

Related Articles

Ready to try Chatsy?

Build your own AI customer support agent in minutes — no code required.

Start Free Trial