Basic RAG is straightforward: chunk your documents, embed them, retrieve the top-K most similar chunks, and feed them to an LLM. It works well on small knowledge bases with clear, well-structured content. But at scale --- thousands of documents, ambiguous queries, mixed content types --- basic RAG falls apart. Answers become irrelevant, hallucinations increase, and users lose trust.
This guide covers the techniques that bridge the gap between a RAG demo and a production system: advanced chunking, re-ranking, query transformation, hybrid retrieval, and rigorous evaluation.
TL;DR:
- Basic RAG fails at scale because fixed-size chunking breaks semantic boundaries, single-vector retrieval misses lexical matches, and top-K ranking without re-scoring returns noisy results.
- Use semantic or document-aware chunking to preserve meaning, hybrid retrieval (BM25 + dense vectors) to cover both keyword and semantic matches, and cross-encoder re-ranking to push the most relevant results to the top.
- Query transformation techniques like HyDE and multi-query expansion dramatically improve recall on ambiguous or under-specified queries. Measure everything with MRR, NDCG, and Recall@K.
Our methodology
This guide synthesizes operational specifics from three categories of sources:
- Production code patterns from open-source repos (e.g., LangChain, LlamaIndex, pgvector documentation, and HuggingFace examples)
- Academic research published on arxiv and in conference proceedings on retrieval and generation
- Practitioner discussions in r/MachineLearning, r/LocalLLaMA, and r/LangChain where engineers report actual production constraints around production RAG pipelines
We avoided pure marketing claims and prioritized examples that ship in real codebases. Where we cite latency or accuracy numbers, the methodology, dataset, or test conditions are noted alongside. Last reviewed: April 2026.
Why Basic RAG Fails at Scale
Consider a knowledge base with 10,000 documents covering product documentation, API references, troubleshooting guides, and policy pages. A customer asks: "Why is my webhook not firing?"
Basic RAG with fixed-size chunking and pure vector search might return:
- A chunk about webhook configuration (relevant)
- A chunk about webhook pricing tiers (irrelevant --- "webhook" keyword match inflated the embedding similarity)
- A chunk from the changelog mentioning webhooks (irrelevant --- recency, not relevance)
- A chunk about API rate limits that mentions "firing" in a different context (irrelevant)
Only one of four retrieved chunks is useful. The LLM now has to generate an answer from 75% noise. This is the needle-in-a-haystack problem at scale, and it manifests through three failure modes:
| Failure Mode | Root Cause | Impact |
|---|
| Semantic boundary breaks | Fixed-size chunks split sentences and paragraphs mid-thought | Retrieved chunks lack coherent context |
| Vocabulary mismatch | Pure vector search misses exact terms (error codes, product names) | Relevant documents with exact matches rank lower than semantic near-misses |
| Ranking noise | Top-K by cosine similarity alone does not account for query-document relevance depth | Superficially similar but unhelpful chunks displace truly relevant ones |
The rest of this guide addresses each failure mode with specific techniques and code.
Chunking Strategies Compared
Chunking is the most underestimated component of a RAG pipeline. The way you split documents determines the upper bound of your retrieval quality --- no amount of re-ranking can fix a chunk that lost its meaning at the boundary.
Fixed-Size Chunking
Split text every N tokens with an overlap window.
typescript
function fixedSizeChunk(text: string, chunkSize: number, overlap: number): string[] {
const words = text.split(/\s+/);
const chunks: string[] = [];
let start = 0;
while (start < words.length) {
const end = Math.min(start + chunkSize, words.length);
chunks.push(words.slice(start, end).join(" "));
start += chunkSize - overlap;
}
return chunks;
}
// Usage: 500-word chunks with 100-word overlap
const chunks = fixedSizeChunk(documentText, 500, 100);
Pros: Simple, predictable chunk sizes, works with any content.
Cons: Splits mid-sentence, mid-paragraph, even mid-code-block. The overlap helps but does not solve the fundamental problem of breaking semantic units.
Best for: Quick prototypes and uniformly structured content.
Sentence-Based Chunking
Group sentences up to a token budget, respecting sentence boundaries.
typescript
function sentenceChunk(text: string, maxTokens: number): string[] {
// Split on sentence boundaries
const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
const chunks: string[] = [];
let currentChunk: string[] = [];
let currentTokens = 0;
for (const sentence of sentences) {
const sentenceTokens = estimateTokens(sentence);
if (currentTokens + sentenceTokens > maxTokens && currentChunk.length > 0) {
chunks.push(currentChunk.join(" "));
currentChunk = [];
currentTokens = 0;
}
currentChunk.push(sentence.trim());
currentTokens += sentenceTokens;
}
if (currentChunk.length > 0) {
chunks.push(currentChunk.join(" "));
}
return chunks;
}
Pros: Never splits mid-sentence. More coherent chunks than fixed-size.
Cons: Sentence detection is fragile (abbreviations, decimal numbers, URLs). Does not respect higher-level structure like sections or paragraphs.
Semantic Chunking
Use embedding similarity between consecutive sentences to detect topic shifts. When similarity drops below a threshold, insert a chunk boundary.
python
import numpy as np
from openai import OpenAI
client = OpenAI()
def semantic_chunk(text: str, threshold: float = 0.75, max_tokens: int = 800) -> list[str]:
"""Split text at semantic boundaries using embedding similarity."""
sentences = split_sentences(text)
# Embed each sentence
response = client.embeddings.create(
input=sentences,
model="text-embedding-3-small"
)
embeddings = [e.embedding for e in response.data]
# Compute cosine similarity between consecutive sentences
chunks = []
current_chunk = [sentences[0]]
current_tokens = estimate_tokens(sentences[0])
for i in range(1, len(sentences)):
sim = cosine_similarity(embeddings[i - 1], embeddings[i])
sent_tokens = estimate_tokens(sentences[i])
# Break if similarity drops below threshold OR token budget exceeded
if sim < threshold or current_tokens + sent_tokens > max_tokens:
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
current_tokens = sent_tokens
else:
current_chunk.append(sentences[i])
current_tokens += sent_tokens
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
Pros: Chunks align with topic boundaries. Adjacent sentences about the same topic stay together.
Cons: Requires embedding every sentence (expensive for large corpora). Threshold tuning is dataset-specific. Can produce highly variable chunk sizes.
Document-Aware Chunking
Parse the document structure (headings, sections, code blocks, lists) and chunk at structural boundaries. This is the most effective strategy for structured content like documentation, help articles, and technical guides.
typescript
interface DocumentSection {
heading: string;
level: number; // h1=1, h2=2, etc.
content: string;
codeBlocks: string[];
}
function documentAwareChunk(
markdown: string,
maxTokens: number = 800
): Array<{ content: string; metadata: { heading: string; level: number } }> {
const sections = parseMarkdownSections(markdown);
const chunks: Array<{ content: string; metadata: { heading: string; level: number } }> = [];
for (const section of sections) {
const sectionText = `## ${section.heading}\n\n${section.content}`;
const tokens = estimateTokens(sectionText);
if (tokens <= maxTokens) {
// Section fits in one chunk --- keep it whole
chunks.push({
content: sectionText,
metadata: { heading: section.heading, level: section.level },
});
} else {
// Section too large --- split by paragraphs, keeping heading as prefix
const paragraphs = section.content.split(/\n\n+/);
let currentContent = `## ${section.heading}\n\n`;
let currentTokens = estimateTokens(currentContent);
for (const para of paragraphs) {
const paraTokens = estimateTokens(para);
if (currentTokens + paraTokens > maxTokens && currentTokens > estimateTokens(`## ${section.heading}\n\n`)) {
chunks.push({
content: currentContent.trim(),
metadata: { heading: section.heading, level: section.level },
});
currentContent = `## ${section.heading} (continued)\n\n`;
currentTokens = estimateTokens(currentContent);
}
currentContent += para + "\n\n";
currentTokens += paraTokens;
}
if (currentTokens > estimateTokens(`## ${section.heading}\n\n`)) {
chunks.push({
content: currentContent.trim(),
metadata: { heading: section.heading, level: section.level },
});
}
}
}
return chunks;
}
Pros: Preserves document structure. Headings provide natural metadata for filtering. Code blocks stay intact.
Cons: Requires a parser for each content format (Markdown, HTML, PDF). Does not work well for unstructured content like emails or chat logs.
Parent-Child Chunking
Store both large parent chunks and smaller child chunks. Retrieve on child chunks (more precise), but pass parent chunks to the LLM (more context).
typescript
interface ParentChildChunk {
parentId: string;
parentContent: string; // ~1500 tokens
children: Array<{
childId: string;
content: string; // ~300 tokens
embedding: number[];
}>;
}
async function parentChildRetrieval(
query: string,
topK: number = 5,
expandToParent: boolean = true
): Promise<string[]> {
const queryEmbedding = await embed(query);
// Search against child chunks for precision
const childResults = await vectorDb.search({
embedding: queryEmbedding,
table: "child_chunks",
limit: topK,
});
if (!expandToParent) {
return childResults.map((r) => r.content);
}
// Expand to parent chunks, deduplicating
const parentIds = [...new Set(childResults.map((r) => r.parentId))];
const parents = await db.query(
`SELECT content FROM parent_chunks WHERE id = ANY($1)`,
[parentIds]
);
return parents.map((p) => p.content);
}
Pros: Best of both worlds --- precise retrieval with rich context. The LLM gets surrounding paragraphs that help it reason.
Cons: Storage overhead (you store both granularities). Requires a join at query time.
Chunking Strategy Decision Matrix
| Strategy | Best For | Chunk Quality | Implementation Cost | Embedding Cost |
|---|
| Fixed-size | Prototypes, uniform text | Low | Minimal | Low |
| Sentence-based | Articles, prose | Medium | Low | Low |
| Semantic | Mixed-topic documents | High | Medium | High (per-sentence) |
| Document-aware | Structured docs (Markdown, HTML) | High | Medium | Low |
| Parent-child | Complex docs needing context | Highest | High | Medium |
For most production systems, document-aware chunking with a parent-child expansion strategy offers the best trade-off.
Re-Ranking: Pushing the Best Results to the Top
Vector similarity gives you a rough ordering. Re-ranking refines it. A re-ranker takes the query and each candidate chunk as a pair and scores them for relevance --- typically using a cross-encoder model that sees both texts simultaneously.
Why Re-Ranking Matters
Bi-encoder search (the standard embedding approach) encodes query and document independently. It cannot capture fine-grained interactions between query terms and document terms. A cross-encoder processes both together, enabling attention across the query-document pair.
Bi-encoder (embedding search):
Query → Encoder → query_vec ─┐
├→ cosine_sim → score
Doc → Encoder → doc_vec ─┘
Cross-encoder (re-ranking):
[Query + Doc] → Encoder → relevance_score
The cross-encoder is more accurate but also more expensive --- you cannot run it over millions of documents. The standard pattern is a two-stage pipeline:
- Stage 1 (Retrieval): Fast bi-encoder search returns top-50 to top-100 candidates.
- Stage 2 (Re-ranking): Cross-encoder scores and re-orders the candidates, returning top-5 to top-10.
Implementation with Cohere Rerank
typescript
import { CohereClient } from "cohere-ai";
const cohere = new CohereClient({ token: process.env.COHERE_API_KEY });
async function retrieveAndRerank(
query: string,
topK: number = 5
): Promise<Array<{ content: string; relevanceScore: number }>> {
// Stage 1: Retrieve top-50 candidates via vector search
const candidates = await vectorSearch(query, 50);
// Stage 2: Re-rank with Cohere
const reranked = await cohere.rerank({
model: "rerank-english-v3.0",
query,
documents: candidates.map((c) => ({ text: c.content })),
topN: topK,
});
return reranked.results.map((r) => ({
content: candidates[r.index].content,
relevanceScore: r.relevanceScore,
}));
}
ColBERT: Late Interaction Re-Ranking
ColBERT uses a "late interaction" approach: it encodes query and document separately (like a bi-encoder) but compares them at the token level rather than the vector level. Each query token attends to the most similar document token via MaxSim.
python
# ColBERT-style scoring (simplified)
import torch
def colbert_score(query_embeddings: torch.Tensor, doc_embeddings: torch.Tensor) -> float:
"""
query_embeddings: (num_query_tokens, dim)
doc_embeddings: (num_doc_tokens, dim)
"""
# For each query token, find max similarity with any doc token
similarity_matrix = query_embeddings @ doc_embeddings.T # (Q, D)
max_sim_per_query_token = similarity_matrix.max(dim=1).values # (Q,)
return max_sim_per_query_token.sum().item()
ColBERT is faster than a full cross-encoder because document embeddings can be precomputed and indexed. It is a strong middle ground when you need re-ranking quality without the latency of running a cross-encoder on every candidate.
When to Add Re-Ranking
| Signal | Action |
|---|
| Retrieval recall is high but precision is low | Add re-ranking --- you are finding relevant docs but not surfacing them first |
| Retrieval recall is low | Fix chunking and retrieval first --- re-ranking cannot surface docs you did not retrieve |
| Latency budget is tight (<500ms total) | Use ColBERT or skip re-ranking; cross-encoders add 100--300ms |
| Knowledge base is small (<500 docs) | Re-ranking may not be necessary; top-K from vector search is likely sufficient |
Sometimes the problem is not retrieval but the query itself. Users ask vague, under-specified, or poorly phrased questions. Query transformation techniques rewrite the query before retrieval to improve recall.
HyDE (Hypothetical Document Embedding)
Instead of embedding the raw query, ask the LLM to generate a hypothetical answer, then embed that answer and use it for retrieval. The hypothesis is closer in embedding space to actual relevant documents than the raw question.
typescript
async function hydeRetrieval(query: string, topK: number = 5): Promise<string[]> {
// Step 1: Generate a hypothetical answer
const hypothetical = await llm.chat({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content: `Given a question, write a short paragraph that would be a good
answer to it. Do not say "I don't know." Write as if you are a knowledgeable
support article, even if you have to guess.`,
},
{ role: "user", content: query },
],
});
// Step 2: Embed the hypothetical answer (not the original query)
const hydeEmbedding = await embed(hypothetical.content);
// Step 3: Search with the hypothetical embedding
return vectorSearch(hydeEmbedding, topK);
}
When HyDE helps: Vague queries like "it's not working" where the raw embedding is too generic. The hypothetical answer adds specificity.
When HyDE hurts: Specific, well-formed queries where the hypothesis might hallucinate and drift from the actual intent.
Multi-Query Expansion
Generate multiple reformulations of the query, retrieve with each, and merge the results. This casts a wider net.
typescript
async function multiQueryRetrieval(
query: string,
topK: number = 5
): Promise<string[]> {
// Generate 3 reformulations
const expansions = await llm.chat({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content: `Generate 3 alternative phrasings of the following question.
Return as a JSON array of strings. Each rephrasing should approach the question
from a different angle.`,
},
{ role: "user", content: query },
],
response_format: { type: "json_object" },
});
const queries = [query, ...JSON.parse(expansions.content).queries];
// Retrieve with each query
const allResults = await Promise.all(
queries.map((q) => vectorSearch(q, topK * 2))
);
// Merge with Reciprocal Rank Fusion
return reciprocalRankFusion(allResults, topK);
}
Step-Back Prompting
For specific queries, generate a more general "step-back" question that retrieves broader context. Then retrieve with both the original and step-back query.
typescript
// Original: "Why does my webhook return 403 when using OAuth?"
// Step-back: "How does webhook authentication work?"
async function stepBackRetrieval(query: string, topK: number = 5): Promise<string[]> {
const stepBack = await llm.chat({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content: `Given a specific question, generate a more general question that
would retrieve useful background context. Return only the question.`,
},
{ role: "user", content: query },
],
});
const [specificResults, broadResults] = await Promise.all([
vectorSearch(query, topK),
vectorSearch(stepBack.content, topK),
]);
// Combine: prioritize specific results, supplement with broad context
return deduplicateAndMerge(specificResults, broadResults, topK);
}
Hybrid Retrieval Pipeline
The highest-performing retrieval systems combine multiple search methods and merge results. Here is a production pipeline combining BM25 keyword search, dense vector search, and metadata filtering:
typescript
interface RetrievalResult {
chunkId: string;
content: string;
score: number;
source: "bm25" | "vector" | "metadata";
}
async function hybridRetrieval(
query: string,
filters?: { category?: string; updatedAfter?: string },
topK: number = 10
): Promise<RetrievalResult[]> {
// Run all three retrieval methods in parallel
const [bm25Results, vectorResults, metadataResults] = await Promise.all([
// BM25 keyword search (handles exact matches, error codes, product names)
bm25Search(query, topK * 3),
// Dense vector search (handles semantic similarity)
vectorSearch(query, topK * 3),
// Metadata-filtered search (narrows by category, date, etc.)
filters ? metadataFilteredSearch(query, filters, topK * 2) : Promise.resolve([]),
]);
// Merge with Reciprocal Rank Fusion
return reciprocalRankFusion(
[bm25Results, vectorResults, metadataResults],
topK
);
}
function reciprocalRankFusion(
resultSets: RetrievalResult[][],
topK: number,
k: number = 60 // RRF constant
): RetrievalResult[] {
const scores = new Map<string, { score: number; result: RetrievalResult }>();
for (const results of resultSets) {
for (let rank = 0; rank < results.length; rank++) {
const result = results[rank];
const rrfScore = 1 / (k + rank + 1);
const existing = scores.get(result.chunkId);
if (existing) {
existing.score += rrfScore;
} else {
scores.set(result.chunkId, { score: rrfScore, result });
}
}
}
return [...scores.values()]
.sort((a, b) => b.score - a.score)
.slice(0, topK)
.map((s) => ({ ...s.result, score: s.score }));
}
Full Pipeline: Retrieve, Fuse, Re-Rank, Generate
Putting it all together:
typescript
async function advancedRAGPipeline(
query: string,
conversationHistory: Message[]
): Promise<string> {
// 1. Query transformation: expand into multiple queries
const expandedQueries = await expandQuery(query);
// 2. Hybrid retrieval for each expanded query
const allResults = await Promise.all(
expandedQueries.map((q) => hybridRetrieval(q, undefined, 20))
);
// 3. Fuse results across all queries
const fused = reciprocalRankFusion(allResults, 20);
// 4. Re-rank with cross-encoder
const reranked = await cohereRerank(query, fused, 5);
// 5. Assemble context and generate
const context = reranked.map((r) => r.content).join("\n\n---\n\n");
const response = await llm.chat({
model: "gpt-4o",
messages: [
{
role: "system",
content: `Answer the user's question using ONLY the provided context.
If the context does not contain enough information, say so.
Do not make up information.
Context:
${context}`,
},
...conversationHistory,
{ role: "user", content: query },
],
});
return response.content;
}
Evaluation Metrics
You cannot optimize what you do not measure. Here are the metrics that matter for RAG retrieval quality:
MRR (Mean Reciprocal Rank)
The average of 1/rank for the first relevant result across queries. MRR = 1.0 means the correct answer is always first.
python
def mean_reciprocal_rank(queries: list[dict]) -> float:
"""
queries: [{ "query": str, "retrieved": [str], "relevant": set[str] }]
"""
reciprocal_ranks = []
for q in queries:
for rank, doc_id in enumerate(q["retrieved"], 1):
if doc_id in q["relevant"]:
reciprocal_ranks.append(1.0 / rank)
break
else:
reciprocal_ranks.append(0.0)
return sum(reciprocal_ranks) / len(reciprocal_ranks)
NDCG (Normalized Discounted Cumulative Gain)
Accounts for the position of all relevant results, not just the first. Higher-ranked relevant results contribute more to the score.
python
import numpy as np
def ndcg_at_k(retrieved: list[str], relevant: dict[str, int], k: int) -> float:
"""
relevant: { doc_id: relevance_grade } where grade is 0, 1, 2, 3
"""
dcg = 0.0
for i, doc_id in enumerate(retrieved[:k]):
rel = relevant.get(doc_id, 0)
dcg += (2**rel - 1) / np.log2(i + 2) # i+2 because log2(1) = 0
# Ideal DCG: sort by relevance grade descending
ideal_rels = sorted(relevant.values(), reverse=True)[:k]
idcg = sum((2**rel - 1) / np.log2(i + 2) for i, rel in enumerate(ideal_rels))
return dcg / idcg if idcg > 0 else 0.0
Recall@K
What fraction of all relevant documents appear in the top-K results. Critical for ensuring you do not miss important context.
python
def recall_at_k(retrieved: list[str], relevant: set[str], k: int) -> float:
retrieved_set = set(retrieved[:k])
return len(retrieved_set & relevant) / len(relevant) if relevant else 0.0
Practical Benchmarks
| Metric | Baseline (Basic RAG) | +Semantic Chunking | +Hybrid Retrieval | +Re-ranking | +Query Expansion |
|---|
| MRR | 0.52 | 0.61 | 0.71 | 0.82 | 0.85 |
| NDCG@5 | 0.45 | 0.54 | 0.65 | 0.76 | 0.79 |
| Recall@10 | 0.60 | 0.68 | 0.79 | 0.82 | 0.88 |
Each optimization layer compounds. The full pipeline typically achieves 60--80% improvement over basic RAG in retrieval quality.
Production Checklist
Caching
- Embedding cache: Hash the input text and cache embeddings. Same query should not be embedded twice.
- Result cache: Cache retrieval results for frequent queries (TTL of 5--15 minutes depending on update frequency).
- LLM response cache: For identical query + context combinations, return cached responses.
Batching
- Batch embedding requests: most APIs support up to 2,048 inputs per call. Group chunks during indexing.
- Batch re-ranking: send all candidates in a single API call rather than one at a time.
Latency Budgets
| Stage | Budget | Typical |
|---|
| Query expansion | 150ms | 80--120ms |
| Hybrid retrieval (parallel) | 100ms | 30--80ms |
| RRF fusion | 5ms | 1--3ms |
| Re-ranking | 200ms | 100--250ms |
| LLM generation | 2000ms | 800--2500ms |
| Total | 2500ms | 1000--3000ms |
Retrieval should be the fast part. If retrieval exceeds 300ms, profile your vector index configuration and database connection pooling.
Monitoring
Track these in production:
- Retrieval latency P50/P95/P99 per stage
- Empty result rate: queries that return zero relevant chunks after re-ranking
- Re-ranker score distribution: if scores cluster near zero, your retrieval is returning poor candidates
- Cache hit rate: target >30% for embedding cache, >10% for result cache
- Token usage: track tokens consumed by query expansion, context assembly, and generation separately
Key Takeaways
- Chunking is the foundation. No amount of re-ranking fixes a chunk that lost its meaning at a boundary. Use document-aware or semantic chunking for production systems.
- Hybrid retrieval is non-negotiable at scale. BM25 handles exact matches (error codes, names); vectors handle semantics. Combine with Reciprocal Rank Fusion.
- Re-ranking is the highest-ROI optimization. A cross-encoder re-ranker on top-50 candidates typically improves MRR by 15--20 points.
- Transform the query, not just the retrieval. HyDE, multi-query expansion, and step-back prompting improve recall on ambiguous queries.
- Measure relentlessly. MRR, NDCG, and Recall@K on a labeled eval set are the only reliable signals. Vibes-based evaluation does not scale.
When advanced RAG is the wrong investment
- Knowledge bases under a few hundred well-structured pages where naive top-K retrieval already produces grounded answers
- Use cases dominated by structured queries (order status, account balance) where SQL or API calls beat any retrieval pipeline
- Teams without an evaluation set, since chunking and re-ranking changes need measurable wins to justify the added complexity
- Latency-sensitive voice or chat flows where a re-ranker plus query rewrite blows the turn budget
- Static FAQ-style corpora where periodic full reindexing with simple chunks is cheaper than a sophisticated pipeline
- Early stage products that have not yet shipped a basic RAG baseline to real users
Frequently Asked Questions
What is the difference between basic RAG and advanced RAG?
Basic RAG uses fixed-size chunking, single-vector retrieval, and top-K ranking by cosine similarity. Advanced RAG adds semantic or document-aware chunking, hybrid retrieval (BM25 + dense vectors), cross-encoder re-ranking, and query transformation techniques. The difference in practice is significant: advanced RAG typically achieves 60--80% higher retrieval quality (MRR, NDCG) on real-world knowledge bases compared to basic RAG.
Which chunking strategy should I use?
For structured content like documentation or help articles, use document-aware chunking that respects headings, sections, and code blocks. For unstructured content like emails or chat logs, use semantic chunking based on embedding similarity between consecutive sentences. If you need both precise retrieval and rich context for the LLM, add a parent-child layer where you retrieve on small child chunks but expand to larger parent chunks.
When should I add a re-ranker to my RAG pipeline?
Add a re-ranker when your retrieval recall is high but precision is low --- meaning relevant documents are in your candidate set but not ranked at the top. If your vector search returns 50 candidates and the correct answer is usually somewhere in the set but not in the top 5, a cross-encoder re-ranker will dramatically improve results. Skip re-ranking if your knowledge base is small (<500 documents) or your latency budget is very tight.
How do I evaluate my RAG pipeline?
Build a labeled evaluation set of at least 100 query-document pairs where humans have marked which documents are relevant to each query. Compute MRR (how quickly you find the first relevant result), NDCG@K (quality of the full ranking), and Recall@K (what fraction of relevant documents you retrieve). Run these metrics after each change to your pipeline --- chunking, retrieval, or re-ranking --- to measure the impact.
What is Reciprocal Rank Fusion and why is it used in hybrid retrieval?
Reciprocal Rank Fusion (RRF) merges ranked lists from different retrieval methods (e.g., BM25 and vector search) into a single ranking. For each document, RRF computes a score of 1/(k + rank) from each list and sums them. Documents that rank highly in multiple lists get boosted. RRF is preferred over simple score normalization because scores from different retrieval systems are not comparable (BM25 scores and cosine similarities have different scales and distributions). RRF only uses rank positions, making it robust across any combination of retrieval methods.
Related Articles