Advanced RAG Optimization: Chunking, Re-ranking &

Q: What is the difference between basic RAG and advanced RAG?

Basic [RAG](/glossary/retrieval-augmented-generation) uses fixed-size chunking, single-vector retrieval, and top-K ranking by cosine similarity. Advanced RAG adds semantic or document-aware chunking, hybrid retrieval (BM25 + dense vectors), cross-encoder re-ranking, and query transformation techniques. The difference in practice is significant: advanced RAG typically achieves 60--80% higher retrieval quality (MRR, NDCG) on real-world knowledge bases compared to basic RAG.

Q: How do I evaluate my RAG pipeline?

Build a labeled evaluation set of at least 100 query-document pairs where humans have marked which documents are relevant to each query. Compute MRR (how quickly you find the first relevant result), NDCG@K (quality of the full ranking), and Recall@K (what fraction of relevant documents you retrieve). Run these metrics after each change to your pipeline --- chunking, retrieval, or re-ranking --- to measure the impact.

Q: What is Reciprocal Rank Fusion and why is it used in hybrid retrieval?

Reciprocal Rank Fusion (RRF) merges ranked lists from different retrieval methods (e.g., BM25 and vector search) into a single ranking. For each document, RRF computes a score of 1/(k + rank) from each list and sums them. Documents that rank highly in multiple lists get boosted. RRF is preferred over simple score normalization because scores from different retrieval systems are not comparable (BM25 scores and cosine similarities have different scales and distributions). RRF only uses rank positions, making it robust across any combination of retrieval methods.

Basic RAG is straightforward: chunk your documents, embed them, retrieve the top-K most similar chunks, and feed them to an LLM. It works well on small knowledge bases with clear, well-structured content. But at scale --- thousands of documents, ambiguous queries, mixed content types --- basic RAG falls apart. Answers become irrelevant, hallucinations increase, and users lose trust.

This guide covers the techniques that bridge the gap between a RAG demo and a production system: advanced chunking, re-ranking, query transformation, hybrid retrieval, and rigorous evaluation.

TL;DR:

Basic RAG fails at scale because fixed-size chunking breaks semantic boundaries, single-vector retrieval misses lexical matches, and top-K ranking without re-scoring returns noisy results.

Use semantic or document-aware chunking to preserve meaning, hybrid retrieval (BM25 + dense vectors) to cover both keyword and semantic matches, and cross-encoder re-ranking to push the most relevant results to the top.

Query transformation techniques like HyDE and multi-query expansion dramatically improve recall on ambiguous or under-specified queries. Measure everything with MRR, NDCG, and Recall@K.

Our methodology

This guide synthesizes operational specifics from three categories of sources:

Production code patterns from open-source repos (e.g., LangChain, LlamaIndex, pgvector documentation, and HuggingFace examples)
Academic research published on arxiv and in conference proceedings on retrieval and generation
Practitioner discussions in r/MachineLearning, r/LocalLLaMA, and r/LangChain where engineers report actual production constraints around production RAG pipelines

We avoided pure marketing claims and prioritized examples that ship in real codebases. Where we cite latency or accuracy numbers, the methodology, dataset, or test conditions are noted alongside. Last reviewed: April 2026.

Why Basic RAG Fails at Scale

Consider a knowledge base with 10,000 documents covering product documentation, API references, troubleshooting guides, and policy pages. A customer asks: "Why is my webhook not firing?"

Basic RAG with fixed-size chunking and pure vector search might return:

A chunk about webhook configuration (relevant)
A chunk about webhook pricing tiers (irrelevant --- "webhook" keyword match inflated the embedding similarity)
A chunk from the changelog mentioning webhooks (irrelevant --- recency, not relevance)
A chunk about API rate limits that mentions "firing" in a different context (irrelevant)

Only one of four retrieved chunks is useful. The LLM now has to generate an answer from 75% noise. This is the needle-in-a-haystack problem at scale, and it manifests through three failure modes:

Failure Mode	Root Cause	Impact
Semantic boundary breaks	Fixed-size chunks split sentences and paragraphs mid-thought	Retrieved chunks lack coherent context
Vocabulary mismatch	Pure vector search misses exact terms (error codes, product names)	Relevant documents with exact matches rank lower than semantic near-misses
Ranking noise	Top-K by cosine similarity alone does not account for query-document relevance depth	Superficially similar but unhelpful chunks displace truly relevant ones

The rest of this guide addresses each failure mode with specific techniques and code.

Chunking Strategies Compared

Chunking is the most underestimated component of a RAG pipeline. The way you split documents determines the upper bound of your retrieval quality --- no amount of re-ranking can fix a chunk that lost its meaning at the boundary.

Fixed-Size Chunking

Split text every N tokens with an overlap window.

typescript
function fixedSizeChunk(text: string, chunkSize: number, overlap: number): string[] {
  const words = text.split(/\s+/);
  const chunks: string[] = [];
  let start = 0;

  while (start < words.length) {
    const end = Math.min(start + chunkSize, words.length);
    chunks.push(words.slice(start, end).join(" "));
    start += chunkSize - overlap;
  }

  return chunks;
}

// Usage: 500-word chunks with 100-word overlap
const chunks = fixedSizeChunk(documentText, 500, 100);

Pros: Simple, predictable chunk sizes, works with any content. Cons: Splits mid-sentence, mid-paragraph, even mid-code-block. The overlap helps but does not solve the fundamental problem of breaking semantic units.

Best for: Quick prototypes and uniformly structured content.

Sentence-Based Chunking

Group sentences up to a token budget, respecting sentence boundaries.

typescript
function sentenceChunk(text: string, maxTokens: number): string[] {
  // Split on sentence boundaries
  const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
  const chunks: string[] = [];
  let currentChunk: string[] = [];
  let currentTokens = 0;

  for (const sentence of sentences) {
    const sentenceTokens = estimateTokens(sentence);

    if (currentTokens + sentenceTokens > maxTokens && currentChunk.length > 0) {
      chunks.push(currentChunk.join(" "));
      currentChunk = [];
      currentTokens = 0;
    }

    currentChunk.push(sentence.trim());
    currentTokens += sentenceTokens;
  }

  if (currentChunk.length > 0) {
    chunks.push(currentChunk.join(" "));
  }

  return chunks;
}

Pros: Never splits mid-sentence. More coherent chunks than fixed-size. Cons: Sentence detection is fragile (abbreviations, decimal numbers, URLs). Does not respect higher-level structure like sections or paragraphs.

Semantic Chunking

Use embedding similarity between consecutive sentences to detect topic shifts. When similarity drops below a threshold, insert a chunk boundary.

python
import numpy as np
from openai import OpenAI

client = OpenAI()

def semantic_chunk(text: str, threshold: float = 0.75, max_tokens: int = 800) -> list[str]:
    """Split text at semantic boundaries using embedding similarity."""
    sentences = split_sentences(text)

    # Embed each sentence
    response = client.embeddings.create(
        input=sentences,
        model="text-embedding-3-small"
    )
    embeddings = [e.embedding for e in response.data]

    # Compute cosine similarity between consecutive sentences
    chunks = []
    current_chunk = [sentences[0]]
    current_tokens = estimate_tokens(sentences[0])

    for i in range(1, len(sentences)):
        sim = cosine_similarity(embeddings[i - 1], embeddings[i])
        sent_tokens = estimate_tokens(sentences[i])

        # Break if similarity drops below threshold OR token budget exceeded
        if sim < threshold or current_tokens + sent_tokens > max_tokens:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
            current_tokens = sent_tokens
        else:
            current_chunk.append(sentences[i])
            current_tokens += sent_tokens

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

Pros: Chunks align with topic boundaries. Adjacent sentences about the same topic stay together. Cons: Requires embedding every sentence (expensive for large corpora). Threshold tuning is dataset-specific. Can produce highly variable chunk sizes.

Document-Aware Chunking

Parse the document structure (headings, sections, code blocks, lists) and chunk at structural boundaries. This is the most effective strategy for structured content like documentation, help articles, and technical guides.

typescript
interface DocumentSection {
  heading: string;
  level: number; // h1=1, h2=2, etc.
  content: string;
  codeBlocks: string[];
}

function documentAwareChunk(
  markdown: string,
  maxTokens: number = 800
): Array<{ content: string; metadata: { heading: string; level: number } }> {
  const sections = parseMarkdownSections(markdown);
  const chunks: Array<{ content: string; metadata: { heading: string; level: number } }> = [];

  for (const section of sections) {
    const sectionText = `## ${section.heading}\n\n${section.content}`;
    const tokens = estimateTokens(sectionText);

    if (tokens <= maxTokens) {
      // Section fits in one chunk --- keep it whole
      chunks.push({
        content: sectionText,
        metadata: { heading: section.heading, level: section.level },
      });
    } else {
      // Section too large --- split by paragraphs, keeping heading as prefix
      const paragraphs = section.content.split(/\n\n+/);
      let currentContent = `## ${section.heading}\n\n`;
      let currentTokens = estimateTokens(currentContent);

      for (const para of paragraphs) {
        const paraTokens = estimateTokens(para);

        if (currentTokens + paraTokens > maxTokens && currentTokens > estimateTokens(`## ${section.heading}\n\n`)) {
          chunks.push({
            content: currentContent.trim(),
            metadata: { heading: section.heading, level: section.level },
          });
          currentContent = `## ${section.heading} (continued)\n\n`;
          currentTokens = estimateTokens(currentContent);
        }

        currentContent += para + "\n\n";
        currentTokens += paraTokens;
      }

      if (currentTokens > estimateTokens(`## ${section.heading}\n\n`)) {
        chunks.push({
          content: currentContent.trim(),
          metadata: { heading: section.heading, level: section.level },
        });
      }
    }
  }

  return chunks;
}

Pros: Preserves document structure. Headings provide natural metadata for filtering. Code blocks stay intact. Cons: Requires a parser for each content format (Markdown, HTML, PDF). Does not work well for unstructured content like emails or chat logs.

Parent-Child Chunking

Store both large parent chunks and smaller child chunks. Retrieve on child chunks (more precise), but pass parent chunks to the LLM (more context).

typescript
interface ParentChildChunk {
  parentId: string;
  parentContent: string;  // ~1500 tokens
  children: Array<{
    childId: string;
    content: string;       // ~300 tokens
    embedding: number[];
  }>;
}

async function parentChildRetrieval(
  query: string,
  topK: number = 5,
  expandToParent: boolean = true
): Promise<string[]> {
  const queryEmbedding = await embed(query);

  // Search against child chunks for precision
  const childResults = await vectorDb.search({
    embedding: queryEmbedding,
    table: "child_chunks",
    limit: topK,
  });

  if (!expandToParent) {
    return childResults.map((r) => r.content);
  }

  // Expand to parent chunks, deduplicating
  const parentIds = [...new Set(childResults.map((r) => r.parentId))];
  const parents = await db.query(
    `SELECT content FROM parent_chunks WHERE id = ANY($1)`,
    [parentIds]
  );

  return parents.map((p) => p.content);
}

Pros: Best of both worlds --- precise retrieval with rich context. The LLM gets surrounding paragraphs that help it reason. Cons: Storage overhead (you store both granularities). Requires a join at query time.

Chunking Strategy Decision Matrix

Strategy	Best For	Chunk Quality	Implementation Cost	Embedding Cost
Fixed-size	Prototypes, uniform text	Low	Minimal	Low
Sentence-based	Articles, prose	Medium	Low	Low
Semantic	Mixed-topic documents	High	Medium	High (per-sentence)
Document-aware	Structured docs (Markdown, HTML)	High	Medium	Low
Parent-child	Complex docs needing context	Highest	High	Medium

For most production systems, document-aware chunking with a parent-child expansion strategy offers the best trade-off.

Re-Ranking: Pushing the Best Results to the Top

Vector similarity gives you a rough ordering. Re-ranking refines it. A re-ranker takes the query and each candidate chunk as a pair and scores them for relevance --- typically using a cross-encoder model that sees both texts simultaneously.

Why Re-Ranking Matters

Bi-encoder search (the standard embedding approach) encodes query and document independently. It cannot capture fine-grained interactions between query terms and document terms. A cross-encoder processes both together, enabling attention across the query-document pair.

Bi-encoder (embedding search):
  Query  → Encoder → query_vec  ─┐
                                  ├→ cosine_sim → score
  Doc    → Encoder → doc_vec   ─┘

Cross-encoder (re-ranking):
  [Query + Doc] → Encoder → relevance_score

The cross-encoder is more accurate but also more expensive --- you cannot run it over millions of documents. The standard pattern is a two-stage pipeline:

Stage 1 (Retrieval): Fast bi-encoder search returns top-50 to top-100 candidates.
Stage 2 (Re-ranking): Cross-encoder scores and re-orders the candidates, returning top-5 to top-10.

Implementation with Cohere Rerank

typescript
import { CohereClient } from "cohere-ai";

const cohere = new CohereClient({ token: process.env.COHERE_API_KEY });

async function retrieveAndRerank(
  query: string,
  topK: number = 5
): Promise<Array<{ content: string; relevanceScore: number }>> {
  // Stage 1: Retrieve top-50 candidates via vector search
  const candidates = await vectorSearch(query, 50);

  // Stage 2: Re-rank with Cohere
  const reranked = await cohere.rerank({
    model: "rerank-english-v3.0",
    query,
    documents: candidates.map((c) => ({ text: c.content })),
    topN: topK,
  });

  return reranked.results.map((r) => ({
    content: candidates[r.index].content,
    relevanceScore: r.relevanceScore,
  }));
}

ColBERT: Late Interaction Re-Ranking

ColBERT uses a "late interaction" approach: it encodes query and document separately (like a bi-encoder) but compares them at the token level rather than the vector level. Each query token attends to the most similar document token via MaxSim.

python
# ColBERT-style scoring (simplified)
import torch

def colbert_score(query_embeddings: torch.Tensor, doc_embeddings: torch.Tensor) -> float:
    """
    query_embeddings: (num_query_tokens, dim)
    doc_embeddings: (num_doc_tokens, dim)
    """
    # For each query token, find max similarity with any doc token
    similarity_matrix = query_embeddings @ doc_embeddings.T  # (Q, D)
    max_sim_per_query_token = similarity_matrix.max(dim=1).values  # (Q,)
    return max_sim_per_query_token.sum().item()

ColBERT is faster than a full cross-encoder because document embeddings can be precomputed and indexed. It is a strong middle ground when you need re-ranking quality without the latency of running a cross-encoder on every candidate.

When to Add Re-Ranking

Signal	Action
Retrieval recall is high but precision is low	Add re-ranking --- you are finding relevant docs but not surfacing them first
Retrieval recall is low	Fix chunking and retrieval first --- re-ranking cannot surface docs you did not retrieve
Latency budget is tight (<500ms total)	Use ColBERT or skip re-ranking; cross-encoders add 100--300ms
Knowledge base is small (<500 docs)	Re-ranking may not be necessary; top-K from vector search is likely sufficient

Query Transformation

Sometimes the problem is not retrieval but the query itself. Users ask vague, under-specified, or poorly phrased questions. Query transformation techniques rewrite the query before retrieval to improve recall.

HyDE (Hypothetical Document Embedding)

Instead of embedding the raw query, ask the LLM to generate a hypothetical answer, then embed that answer and use it for retrieval. The hypothesis is closer in embedding space to actual relevant documents than the raw question.

typescript
async function hydeRetrieval(query: string, topK: number = 5): Promise<string[]> {
  // Step 1: Generate a hypothetical answer
  const hypothetical = await llm.chat({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content: `Given a question, write a short paragraph that would be a good
answer to it. Do not say "I don't know." Write as if you are a knowledgeable
support article, even if you have to guess.`,
      },
      { role: "user", content: query },
    ],
  });

  // Step 2: Embed the hypothetical answer (not the original query)
  const hydeEmbedding = await embed(hypothetical.content);

  // Step 3: Search with the hypothetical embedding
  return vectorSearch(hydeEmbedding, topK);
}

When HyDE helps: Vague queries like "it's not working" where the raw embedding is too generic. The hypothetical answer adds specificity. When HyDE hurts: Specific, well-formed queries where the hypothesis might hallucinate and drift from the actual intent.

Multi-Query Expansion

Generate multiple reformulations of the query, retrieve with each, and merge the results. This casts a wider net.

typescript
async function multiQueryRetrieval(
  query: string,
  topK: number = 5
): Promise<string[]> {
  // Generate 3 reformulations
  const expansions = await llm.chat({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content: `Generate 3 alternative phrasings of the following question.
Return as a JSON array of strings. Each rephrasing should approach the question
from a different angle.`,
      },
      { role: "user", content: query },
    ],
    response_format: { type: "json_object" },
  });

  const queries = [query, ...JSON.parse(expansions.content).queries];

  // Retrieve with each query
  const allResults = await Promise.all(
    queries.map((q) => vectorSearch(q, topK * 2))
  );

  // Merge with Reciprocal Rank Fusion
  return reciprocalRankFusion(allResults, topK);
}

Step-Back Prompting

For specific queries, generate a more general "step-back" question that retrieves broader context. Then retrieve with both the original and step-back query.

typescript
// Original: "Why does my webhook return 403 when using OAuth?"
// Step-back: "How does webhook authentication work?"

async function stepBackRetrieval(query: string, topK: number = 5): Promise<string[]> {
  const stepBack = await llm.chat({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content: `Given a specific question, generate a more general question that
would retrieve useful background context. Return only the question.`,
      },
      { role: "user", content: query },
    ],
  });

  const [specificResults, broadResults] = await Promise.all([
    vectorSearch(query, topK),
    vectorSearch(stepBack.content, topK),
  ]);

  // Combine: prioritize specific results, supplement with broad context
  return deduplicateAndMerge(specificResults, broadResults, topK);
}

Hybrid Retrieval Pipeline

The highest-performing retrieval systems combine multiple search methods and merge results. Here is a production pipeline combining BM25 keyword search, dense vector search, and metadata filtering:

typescript
interface RetrievalResult {
  chunkId: string;
  content: string;
  score: number;
  source: "bm25" | "vector" | "metadata";
}

async function hybridRetrieval(
  query: string,
  filters?: { category?: string; updatedAfter?: string },
  topK: number = 10
): Promise<RetrievalResult[]> {
  // Run all three retrieval methods in parallel
  const [bm25Results, vectorResults, metadataResults] = await Promise.all([
    // BM25 keyword search (handles exact matches, error codes, product names)
    bm25Search(query, topK * 3),
    // Dense vector search (handles semantic similarity)
    vectorSearch(query, topK * 3),
    // Metadata-filtered search (narrows by category, date, etc.)
    filters ? metadataFilteredSearch(query, filters, topK * 2) : Promise.resolve([]),
  ]);

  // Merge with Reciprocal Rank Fusion
  return reciprocalRankFusion(
    [bm25Results, vectorResults, metadataResults],
    topK
  );
}

function reciprocalRankFusion(
  resultSets: RetrievalResult[][],
  topK: number,
  k: number = 60 // RRF constant
): RetrievalResult[] {
  const scores = new Map<string, { score: number; result: RetrievalResult }>();

  for (const results of resultSets) {
    for (let rank = 0; rank < results.length; rank++) {
      const result = results[rank];
      const rrfScore = 1 / (k + rank + 1);
      const existing = scores.get(result.chunkId);

      if (existing) {
        existing.score += rrfScore;
      } else {
        scores.set(result.chunkId, { score: rrfScore, result });
      }
    }
  }

  return [...scores.values()]
    .sort((a, b) => b.score - a.score)
    .slice(0, topK)
    .map((s) => ({ ...s.result, score: s.score }));
}

Full Pipeline: Retrieve, Fuse, Re-Rank, Generate

Putting it all together:

typescript
async function advancedRAGPipeline(
  query: string,
  conversationHistory: Message[]
): Promise<string> {
  // 1. Query transformation: expand into multiple queries
  const expandedQueries = await expandQuery(query);

  // 2. Hybrid retrieval for each expanded query
  const allResults = await Promise.all(
    expandedQueries.map((q) => hybridRetrieval(q, undefined, 20))
  );

  // 3. Fuse results across all queries
  const fused = reciprocalRankFusion(allResults, 20);

  // 4. Re-rank with cross-encoder
  const reranked = await cohereRerank(query, fused, 5);

  // 5. Assemble context and generate
  const context = reranked.map((r) => r.content).join("\n\n---\n\n");

  const response = await llm.chat({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `Answer the user's question using ONLY the provided context.
If the context does not contain enough information, say so.
Do not make up information.

Context:
${context}`,
      },
      ...conversationHistory,
      { role: "user", content: query },
    ],
  });

  return response.content;
}

Evaluation Metrics

You cannot optimize what you do not measure. Here are the metrics that matter for RAG retrieval quality:

MRR (Mean Reciprocal Rank)

The average of 1/rank for the first relevant result across queries. MRR = 1.0 means the correct answer is always first.

python
def mean_reciprocal_rank(queries: list[dict]) -> float:
    """
    queries: [{ "query": str, "retrieved": [str], "relevant": set[str] }]
    """
    reciprocal_ranks = []
    for q in queries:
        for rank, doc_id in enumerate(q["retrieved"], 1):
            if doc_id in q["relevant"]:
                reciprocal_ranks.append(1.0 / rank)
                break
        else:
            reciprocal_ranks.append(0.0)
    return sum(reciprocal_ranks) / len(reciprocal_ranks)

NDCG (Normalized Discounted Cumulative Gain)

Accounts for the position of all relevant results, not just the first. Higher-ranked relevant results contribute more to the score.

python
import numpy as np

def ndcg_at_k(retrieved: list[str], relevant: dict[str, int], k: int) -> float:
    """
    relevant: { doc_id: relevance_grade } where grade is 0, 1, 2, 3
    """
    dcg = 0.0
    for i, doc_id in enumerate(retrieved[:k]):
        rel = relevant.get(doc_id, 0)
        dcg += (2**rel - 1) / np.log2(i + 2)  # i+2 because log2(1) = 0

    # Ideal DCG: sort by relevance grade descending
    ideal_rels = sorted(relevant.values(), reverse=True)[:k]
    idcg = sum((2**rel - 1) / np.log2(i + 2) for i, rel in enumerate(ideal_rels))

    return dcg / idcg if idcg > 0 else 0.0

Recall@K

What fraction of all relevant documents appear in the top-K results. Critical for ensuring you do not miss important context.

python
def recall_at_k(retrieved: list[str], relevant: set[str], k: int) -> float:
    retrieved_set = set(retrieved[:k])
    return len(retrieved_set & relevant) / len(relevant) if relevant else 0.0

Practical Benchmarks

Metric	Baseline (Basic RAG)	+Semantic Chunking	+Hybrid Retrieval	+Re-ranking	+Query Expansion
MRR	0.52	0.61	0.71	0.82	0.85
NDCG@5	0.45	0.54	0.65	0.76	0.79
Recall@10	0.60	0.68	0.79	0.82	0.88

Each optimization layer compounds. The full pipeline typically achieves 60--80% improvement over basic RAG in retrieval quality.

Production Checklist

Caching

Embedding cache: Hash the input text and cache embeddings. Same query should not be embedded twice.
Result cache: Cache retrieval results for frequent queries (TTL of 5--15 minutes depending on update frequency).
LLM response cache: For identical query + context combinations, return cached responses.

Batching

Batch embedding requests: most APIs support up to 2,048 inputs per call. Group chunks during indexing.
Batch re-ranking: send all candidates in a single API call rather than one at a time.

Latency Budgets

Stage	Budget	Typical
Query expansion	150ms	80--120ms
Hybrid retrieval (parallel)	100ms	30--80ms
RRF fusion	5ms	1--3ms
Re-ranking	200ms	100--250ms
LLM generation	2000ms	800--2500ms
Total	2500ms	1000--3000ms

Retrieval should be the fast part. If retrieval exceeds 300ms, profile your vector index configuration and database connection pooling.

Monitoring

Track these in production:

Retrieval latency P50/P95/P99 per stage
Empty result rate: queries that return zero relevant chunks after re-ranking
Re-ranker score distribution: if scores cluster near zero, your retrieval is returning poor candidates
Cache hit rate: target >30% for embedding cache, >10% for result cache
Token usage: track tokens consumed by query expansion, context assembly, and generation separately

Key Takeaways

Chunking is the foundation. No amount of re-ranking fixes a chunk that lost its meaning at a boundary. Use document-aware or semantic chunking for production systems.
Hybrid retrieval is non-negotiable at scale. BM25 handles exact matches (error codes, names); vectors handle semantics. Combine with Reciprocal Rank Fusion.
Re-ranking is the highest-ROI optimization. A cross-encoder re-ranker on top-50 candidates typically improves MRR by 15--20 points.
Transform the query, not just the retrieval. HyDE, multi-query expansion, and step-back prompting improve recall on ambiguous queries.
Measure relentlessly. MRR, NDCG, and Recall@K on a labeled eval set are the only reliable signals. Vibes-based evaluation does not scale.

When advanced RAG is the wrong investment

Knowledge bases under a few hundred well-structured pages where naive top-K retrieval already produces grounded answers
Use cases dominated by structured queries (order status, account balance) where SQL or API calls beat any retrieval pipeline
Teams without an evaluation set, since chunking and re-ranking changes need measurable wins to justify the added complexity
Latency-sensitive voice or chat flows where a re-ranker plus query rewrite blows the turn budget
Static FAQ-style corpora where periodic full reindexing with simple chunks is cheaper than a sophisticated pipeline
Early stage products that have not yet shipped a basic RAG baseline to real users

Frequently Asked Questions

What is the difference between basic RAG and advanced RAG?

Basic RAG uses fixed-size chunking, single-vector retrieval, and top-K ranking by cosine similarity. Advanced RAG adds semantic or document-aware chunking, hybrid retrieval (BM25 + dense vectors), cross-encoder re-ranking, and query transformation techniques. The difference in practice is significant: advanced RAG typically achieves 60--80% higher retrieval quality (MRR, NDCG) on real-world knowledge bases compared to basic RAG.

Which chunking strategy should I use?

For structured content like documentation or help articles, use document-aware chunking that respects headings, sections, and code blocks. For unstructured content like emails or chat logs, use semantic chunking based on embedding similarity between consecutive sentences. If you need both precise retrieval and rich context for the LLM, add a parent-child layer where you retrieve on small child chunks but expand to larger parent chunks.

When should I add a re-ranker to my RAG pipeline?

Add a re-ranker when your retrieval recall is high but precision is low --- meaning relevant documents are in your candidate set but not ranked at the top. If your vector search returns 50 candidates and the correct answer is usually somewhere in the set but not in the top 5, a cross-encoder re-ranker will dramatically improve results. Skip re-ranking if your knowledge base is small (<500 documents) or your latency budget is very tight.

How do I evaluate my RAG pipeline?

Build a labeled evaluation set of at least 100 query-document pairs where humans have marked which documents are relevant to each query. Compute MRR (how quickly you find the first relevant result), NDCG@K (quality of the full ranking), and Recall@K (what fraction of relevant documents you retrieve). Run these metrics after each change to your pipeline --- chunking, retrieval, or re-ranking --- to measure the impact.

What is Reciprocal Rank Fusion and why is it used in hybrid retrieval?

Reciprocal Rank Fusion (RRF) merges ranked lists from different retrieval methods (e.g., BM25 and vector search) into a single ranking. For each document, RRF computes a score of 1/(k + rank) from each list and sums them. Documents that rank highly in multiple lists get boosted. RRF is preferred over simple score normalization because scores from different retrieval systems are not comparable (BM25 scores and cosine similarities have different scales and distributions). RRF only uses rank positions, making it robust across any combination of retrieval methods.

This guide covers the techniques that bridge the gap between a RAG demo and a production system: advanced chunking, re-ranking, query transformation, hybrid retrieval, and rigorous evaluation.

TL;DR:

Basic RAG fails at scale because fixed-size chunking breaks semantic boundaries, single-vector retrieval misses lexical matches, and top-K ranking without re-scoring returns noisy results.

Use semantic or document-aware chunking to preserve meaning, hybrid retrieval (BM25 + dense vectors) to cover both keyword and semantic matches, and cross-encoder re-ranking to push the most relevant results to the top.

Query transformation techniques like HyDE and multi-query expansion dramatically improve recall on ambiguous or under-specified queries. Measure everything with MRR, NDCG, and Recall@K.

Our methodology

This guide synthesizes operational specifics from three categories of sources:

Production code patterns from open-source repos (e.g., LangChain, LlamaIndex, pgvector documentation, and HuggingFace examples)
Academic research published on arxiv and in conference proceedings on retrieval and generation
Practitioner discussions in r/MachineLearning, r/LocalLLaMA, and r/LangChain where engineers report actual production constraints around production RAG pipelines

Why Basic RAG Fails at Scale

Consider a knowledge base with 10,000 documents covering product documentation, API references, troubleshooting guides, and policy pages. A customer asks: "Why is my webhook not firing?"

Basic RAG with fixed-size chunking and pure vector search might return:

A chunk about webhook configuration (relevant)
A chunk about webhook pricing tiers (irrelevant --- "webhook" keyword match inflated the embedding similarity)
A chunk from the changelog mentioning webhooks (irrelevant --- recency, not relevance)
A chunk about API rate limits that mentions "firing" in a different context (irrelevant)

Only one of four retrieved chunks is useful. The LLM now has to generate an answer from 75% noise. This is the needle-in-a-haystack problem at scale, and it manifests through three failure modes:

Failure Mode	Root Cause	Impact
Semantic boundary breaks	Fixed-size chunks split sentences and paragraphs mid-thought	Retrieved chunks lack coherent context
Vocabulary mismatch	Pure vector search misses exact terms (error codes, product names)	Relevant documents with exact matches rank lower than semantic near-misses
Ranking noise	Top-K by cosine similarity alone does not account for query-document relevance depth	Superficially similar but unhelpful chunks displace truly relevant ones

The rest of this guide addresses each failure mode with specific techniques and code.

Chunking Strategies Compared

Fixed-Size Chunking

Split text every N tokens with an overlap window.

typescript
function fixedSizeChunk(text: string, chunkSize: number, overlap: number): string[] {
  const words = text.split(/\s+/);
  const chunks: string[] = [];
  let start = 0;

  while (start < words.length) {
    const end = Math.min(start + chunkSize, words.length);
    chunks.push(words.slice(start, end).join(" "));
    start += chunkSize - overlap;
  }

  return chunks;
}

// Usage: 500-word chunks with 100-word overlap
const chunks = fixedSizeChunk(documentText, 500, 100);

Best for: Quick prototypes and uniformly structured content.

Sentence-Based Chunking

Group sentences up to a token budget, respecting sentence boundaries.

typescript
function sentenceChunk(text: string, maxTokens: number): string[] {
  // Split on sentence boundaries
  const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
  const chunks: string[] = [];
  let currentChunk: string[] = [];
  let currentTokens = 0;

  for (const sentence of sentences) {
    const sentenceTokens = estimateTokens(sentence);

    if (currentTokens + sentenceTokens > maxTokens && currentChunk.length > 0) {
      chunks.push(currentChunk.join(" "));
      currentChunk = [];
      currentTokens = 0;
    }

    currentChunk.push(sentence.trim());
    currentTokens += sentenceTokens;
  }

  if (currentChunk.length > 0) {
    chunks.push(currentChunk.join(" "));
  }

  return chunks;
}

Semantic Chunking

Use embedding similarity between consecutive sentences to detect topic shifts. When similarity drops below a threshold, insert a chunk boundary.

python
import numpy as np
from openai import OpenAI

client = OpenAI()

def semantic_chunk(text: str, threshold: float = 0.75, max_tokens: int = 800) -> list[str]:
    """Split text at semantic boundaries using embedding similarity."""
    sentences = split_sentences(text)

    # Embed each sentence
    response = client.embeddings.create(
        input=sentences,
        model="text-embedding-3-small"
    )
    embeddings = [e.embedding for e in response.data]

    # Compute cosine similarity between consecutive sentences
    chunks = []
    current_chunk = [sentences[0]]
    current_tokens = estimate_tokens(sentences[0])

    for i in range(1, len(sentences)):
        sim = cosine_similarity(embeddings[i - 1], embeddings[i])
        sent_tokens = estimate_tokens(sentences[i])

        # Break if similarity drops below threshold OR token budget exceeded
        if sim < threshold or current_tokens + sent_tokens > max_tokens:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
            current_tokens = sent_tokens
        else:
            current_chunk.append(sentences[i])
            current_tokens += sent_tokens

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

Document-Aware Chunking

typescript
interface DocumentSection {
  heading: string;
  level: number; // h1=1, h2=2, etc.
  content: string;
  codeBlocks: string[];
}

function documentAwareChunk(
  markdown: string,
  maxTokens: number = 800
): Array<{ content: string; metadata: { heading: string; level: number } }> {
  const sections = parseMarkdownSections(markdown);
  const chunks: Array<{ content: string; metadata: { heading: string; level: number } }> = [];

  for (const section of sections) {
    const sectionText = `## ${section.heading}\n\n${section.content}`;
    const tokens = estimateTokens(sectionText);

    if (tokens <= maxTokens) {
      // Section fits in one chunk --- keep it whole
      chunks.push({
        content: sectionText,
        metadata: { heading: section.heading, level: section.level },
      });
    } else {
      // Section too large --- split by paragraphs, keeping heading as prefix
      const paragraphs = section.content.split(/\n\n+/);
      let currentContent = `## ${section.heading}\n\n`;
      let currentTokens = estimateTokens(currentContent);

      for (const para of paragraphs) {
        const paraTokens = estimateTokens(para);

        if (currentTokens + paraTokens > maxTokens && currentTokens > estimateTokens(`## ${section.heading}\n\n`)) {
          chunks.push({
            content: currentContent.trim(),
            metadata: { heading: section.heading, level: section.level },
          });
          currentContent = `## ${section.heading} (continued)\n\n`;
          currentTokens = estimateTokens(currentContent);
        }

        currentContent += para + "\n\n";
        currentTokens += paraTokens;
      }

      if (currentTokens > estimateTokens(`## ${section.heading}\n\n`)) {
        chunks.push({
          content: currentContent.trim(),
          metadata: { heading: section.heading, level: section.level },
        });
      }
    }
  }

  return chunks;
}

Parent-Child Chunking

Store both large parent chunks and smaller child chunks. Retrieve on child chunks (more precise), but pass parent chunks to the LLM (more context).

typescript
interface ParentChildChunk {
  parentId: string;
  parentContent: string;  // ~1500 tokens
  children: Array<{
    childId: string;
    content: string;       // ~300 tokens
    embedding: number[];
  }>;
}

async function parentChildRetrieval(
  query: string,
  topK: number = 5,
  expandToParent: boolean = true
): Promise<string[]> {
  const queryEmbedding = await embed(query);

  // Search against child chunks for precision
  const childResults = await vectorDb.search({
    embedding: queryEmbedding,
    table: "child_chunks",
    limit: topK,
  });

  if (!expandToParent) {
    return childResults.map((r) => r.content);
  }

  // Expand to parent chunks, deduplicating
  const parentIds = [...new Set(childResults.map((r) => r.parentId))];
  const parents = await db.query(
    `SELECT content FROM parent_chunks WHERE id = ANY($1)`,
    [parentIds]
  );

  return parents.map((p) => p.content);
}

Chunking Strategy Decision Matrix

Strategy	Best For	Chunk Quality	Implementation Cost	Embedding Cost
Fixed-size	Prototypes, uniform text	Low	Minimal	Low
Sentence-based	Articles, prose	Medium	Low	Low
Semantic	Mixed-topic documents	High	Medium	High (per-sentence)
Document-aware	Structured docs (Markdown, HTML)	High	Medium	Low
Parent-child	Complex docs needing context	Highest	High	Medium

For most production systems, document-aware chunking with a parent-child expansion strategy offers the best trade-off.

Re-Ranking: Pushing the Best Results to the Top

Why Re-Ranking Matters

Bi-encoder (embedding search):
  Query  → Encoder → query_vec  ─┐
                                  ├→ cosine_sim → score
  Doc    → Encoder → doc_vec   ─┘

Cross-encoder (re-ranking):
  [Query + Doc] → Encoder → relevance_score

The cross-encoder is more accurate but also more expensive --- you cannot run it over millions of documents. The standard pattern is a two-stage pipeline:

Stage 1 (Retrieval): Fast bi-encoder search returns top-50 to top-100 candidates.
Stage 2 (Re-ranking): Cross-encoder scores and re-orders the candidates, returning top-5 to top-10.

Implementation with Cohere Rerank

typescript
import { CohereClient } from "cohere-ai";

const cohere = new CohereClient({ token: process.env.COHERE_API_KEY });

async function retrieveAndRerank(
  query: string,
  topK: number = 5
): Promise<Array<{ content: string; relevanceScore: number }>> {
  // Stage 1: Retrieve top-50 candidates via vector search
  const candidates = await vectorSearch(query, 50);

  // Stage 2: Re-rank with Cohere
  const reranked = await cohere.rerank({
    model: "rerank-english-v3.0",
    query,
    documents: candidates.map((c) => ({ text: c.content })),
    topN: topK,
  });

  return reranked.results.map((r) => ({
    content: candidates[r.index].content,
    relevanceScore: r.relevanceScore,
  }));
}

ColBERT: Late Interaction Re-Ranking

python
# ColBERT-style scoring (simplified)
import torch

def colbert_score(query_embeddings: torch.Tensor, doc_embeddings: torch.Tensor) -> float:
    """
    query_embeddings: (num_query_tokens, dim)
    doc_embeddings: (num_doc_tokens, dim)
    """
    # For each query token, find max similarity with any doc token
    similarity_matrix = query_embeddings @ doc_embeddings.T  # (Q, D)
    max_sim_per_query_token = similarity_matrix.max(dim=1).values  # (Q,)
    return max_sim_per_query_token.sum().item()

When to Add Re-Ranking

Signal	Action
Retrieval recall is high but precision is low	Add re-ranking --- you are finding relevant docs but not surfacing them first
Retrieval recall is low	Fix chunking and retrieval first --- re-ranking cannot surface docs you did not retrieve
Latency budget is tight (<500ms total)	Use ColBERT or skip re-ranking; cross-encoders add 100--300ms
Knowledge base is small (<500 docs)	Re-ranking may not be necessary; top-K from vector search is likely sufficient

Query Transformation

HyDE (Hypothetical Document Embedding)

typescript
async function hydeRetrieval(query: string, topK: number = 5): Promise<string[]> {
  // Step 1: Generate a hypothetical answer
  const hypothetical = await llm.chat({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content: `Given a question, write a short paragraph that would be a good
answer to it. Do not say "I don't know." Write as if you are a knowledgeable
support article, even if you have to guess.`,
      },
      { role: "user", content: query },
    ],
  });

  // Step 2: Embed the hypothetical answer (not the original query)
  const hydeEmbedding = await embed(hypothetical.content);

  // Step 3: Search with the hypothetical embedding
  return vectorSearch(hydeEmbedding, topK);
}

Multi-Query Expansion

Generate multiple reformulations of the query, retrieve with each, and merge the results. This casts a wider net.

typescript
async function multiQueryRetrieval(
  query: string,
  topK: number = 5
): Promise<string[]> {
  // Generate 3 reformulations
  const expansions = await llm.chat({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content: `Generate 3 alternative phrasings of the following question.
Return as a JSON array of strings. Each rephrasing should approach the question
from a different angle.`,
      },
      { role: "user", content: query },
    ],
    response_format: { type: "json_object" },
  });

  const queries = [query, ...JSON.parse(expansions.content).queries];

  // Retrieve with each query
  const allResults = await Promise.all(
    queries.map((q) => vectorSearch(q, topK * 2))
  );

  // Merge with Reciprocal Rank Fusion
  return reciprocalRankFusion(allResults, topK);
}

Step-Back Prompting

For specific queries, generate a more general "step-back" question that retrieves broader context. Then retrieve with both the original and step-back query.

typescript
// Original: "Why does my webhook return 403 when using OAuth?"
// Step-back: "How does webhook authentication work?"

async function stepBackRetrieval(query: string, topK: number = 5): Promise<string[]> {
  const stepBack = await llm.chat({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content: `Given a specific question, generate a more general question that
would retrieve useful background context. Return only the question.`,
      },
      { role: "user", content: query },
    ],
  });

  const [specificResults, broadResults] = await Promise.all([
    vectorSearch(query, topK),
    vectorSearch(stepBack.content, topK),
  ]);

  // Combine: prioritize specific results, supplement with broad context
  return deduplicateAndMerge(specificResults, broadResults, topK);
}

Hybrid Retrieval Pipeline

The highest-performing retrieval systems combine multiple search methods and merge results. Here is a production pipeline combining BM25 keyword search, dense vector search, and metadata filtering:

typescript
interface RetrievalResult {
  chunkId: string;
  content: string;
  score: number;
  source: "bm25" | "vector" | "metadata";
}

async function hybridRetrieval(
  query: string,
  filters?: { category?: string; updatedAfter?: string },
  topK: number = 10
): Promise<RetrievalResult[]> {
  // Run all three retrieval methods in parallel
  const [bm25Results, vectorResults, metadataResults] = await Promise.all([
    // BM25 keyword search (handles exact matches, error codes, product names)
    bm25Search(query, topK * 3),
    // Dense vector search (handles semantic similarity)
    vectorSearch(query, topK * 3),
    // Metadata-filtered search (narrows by category, date, etc.)
    filters ? metadataFilteredSearch(query, filters, topK * 2) : Promise.resolve([]),
  ]);

  // Merge with Reciprocal Rank Fusion
  return reciprocalRankFusion(
    [bm25Results, vectorResults, metadataResults],
    topK
  );
}

function reciprocalRankFusion(
  resultSets: RetrievalResult[][],
  topK: number,
  k: number = 60 // RRF constant
): RetrievalResult[] {
  const scores = new Map<string, { score: number; result: RetrievalResult }>();

  for (const results of resultSets) {
    for (let rank = 0; rank < results.length; rank++) {
      const result = results[rank];
      const rrfScore = 1 / (k + rank + 1);
      const existing = scores.get(result.chunkId);

      if (existing) {
        existing.score += rrfScore;
      } else {
        scores.set(result.chunkId, { score: rrfScore, result });
      }
    }
  }

  return [...scores.values()]
    .sort((a, b) => b.score - a.score)
    .slice(0, topK)
    .map((s) => ({ ...s.result, score: s.score }));
}

Full Pipeline: Retrieve, Fuse, Re-Rank, Generate

Putting it all together:

typescript
async function advancedRAGPipeline(
  query: string,
  conversationHistory: Message[]
): Promise<string> {
  // 1. Query transformation: expand into multiple queries
  const expandedQueries = await expandQuery(query);

  // 2. Hybrid retrieval for each expanded query
  const allResults = await Promise.all(
    expandedQueries.map((q) => hybridRetrieval(q, undefined, 20))
  );

  // 3. Fuse results across all queries
  const fused = reciprocalRankFusion(allResults, 20);

  // 4. Re-rank with cross-encoder
  const reranked = await cohereRerank(query, fused, 5);

  // 5. Assemble context and generate
  const context = reranked.map((r) => r.content).join("\n\n---\n\n");

  const response = await llm.chat({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `Answer the user's question using ONLY the provided context.
If the context does not contain enough information, say so.
Do not make up information.

Context:
${context}`,
      },
      ...conversationHistory,
      { role: "user", content: query },
    ],
  });

  return response.content;
}

Evaluation Metrics

You cannot optimize what you do not measure. Here are the metrics that matter for RAG retrieval quality:

MRR (Mean Reciprocal Rank)

The average of 1/rank for the first relevant result across queries. MRR = 1.0 means the correct answer is always first.

python
def mean_reciprocal_rank(queries: list[dict]) -> float:
    """
    queries: [{ "query": str, "retrieved": [str], "relevant": set[str] }]
    """
    reciprocal_ranks = []
    for q in queries:
        for rank, doc_id in enumerate(q["retrieved"], 1):
            if doc_id in q["relevant"]:
                reciprocal_ranks.append(1.0 / rank)
                break
        else:
            reciprocal_ranks.append(0.0)
    return sum(reciprocal_ranks) / len(reciprocal_ranks)

NDCG (Normalized Discounted Cumulative Gain)

Accounts for the position of all relevant results, not just the first. Higher-ranked relevant results contribute more to the score.

python
import numpy as np

def ndcg_at_k(retrieved: list[str], relevant: dict[str, int], k: int) -> float:
    """
    relevant: { doc_id: relevance_grade } where grade is 0, 1, 2, 3
    """
    dcg = 0.0
    for i, doc_id in enumerate(retrieved[:k]):
        rel = relevant.get(doc_id, 0)
        dcg += (2**rel - 1) / np.log2(i + 2)  # i+2 because log2(1) = 0

    # Ideal DCG: sort by relevance grade descending
    ideal_rels = sorted(relevant.values(), reverse=True)[:k]
    idcg = sum((2**rel - 1) / np.log2(i + 2) for i, rel in enumerate(ideal_rels))

    return dcg / idcg if idcg > 0 else 0.0

Recall@K

What fraction of all relevant documents appear in the top-K results. Critical for ensuring you do not miss important context.

python
def recall_at_k(retrieved: list[str], relevant: set[str], k: int) -> float:
    retrieved_set = set(retrieved[:k])
    return len(retrieved_set & relevant) / len(relevant) if relevant else 0.0

Practical Benchmarks

Metric	Baseline (Basic RAG)	+Semantic Chunking	+Hybrid Retrieval	+Re-ranking	+Query Expansion
MRR	0.52	0.61	0.71	0.82	0.85
NDCG@5	0.45	0.54	0.65	0.76	0.79
Recall@10	0.60	0.68	0.79	0.82	0.88

Each optimization layer compounds. The full pipeline typically achieves 60--80% improvement over basic RAG in retrieval quality.

Production Checklist

Caching

Embedding cache: Hash the input text and cache embeddings. Same query should not be embedded twice.
Result cache: Cache retrieval results for frequent queries (TTL of 5--15 minutes depending on update frequency).
LLM response cache: For identical query + context combinations, return cached responses.

Batching

Batch embedding requests: most APIs support up to 2,048 inputs per call. Group chunks during indexing.
Batch re-ranking: send all candidates in a single API call rather than one at a time.

Latency Budgets

Stage	Budget	Typical
Query expansion	150ms	80--120ms
Hybrid retrieval (parallel)	100ms	30--80ms
RRF fusion	5ms	1--3ms
Re-ranking	200ms	100--250ms
LLM generation	2000ms	800--2500ms
Total	2500ms	1000--3000ms

Retrieval should be the fast part. If retrieval exceeds 300ms, profile your vector index configuration and database connection pooling.

Monitoring

Track these in production:

Retrieval latency P50/P95/P99 per stage
Empty result rate: queries that return zero relevant chunks after re-ranking
Re-ranker score distribution: if scores cluster near zero, your retrieval is returning poor candidates
Cache hit rate: target >30% for embedding cache, >10% for result cache
Token usage: track tokens consumed by query expansion, context assembly, and generation separately

Key Takeaways

Chunking is the foundation. No amount of re-ranking fixes a chunk that lost its meaning at a boundary. Use document-aware or semantic chunking for production systems.
Hybrid retrieval is non-negotiable at scale. BM25 handles exact matches (error codes, names); vectors handle semantics. Combine with Reciprocal Rank Fusion.
Re-ranking is the highest-ROI optimization. A cross-encoder re-ranker on top-50 candidates typically improves MRR by 15--20 points.
Transform the query, not just the retrieval. HyDE, multi-query expansion, and step-back prompting improve recall on ambiguous queries.
Measure relentlessly. MRR, NDCG, and Recall@K on a labeled eval set are the only reliable signals. Vibes-based evaluation does not scale.

When advanced RAG is the wrong investment

Knowledge bases under a few hundred well-structured pages where naive top-K retrieval already produces grounded answers
Use cases dominated by structured queries (order status, account balance) where SQL or API calls beat any retrieval pipeline
Teams without an evaluation set, since chunking and re-ranking changes need measurable wins to justify the added complexity
Latency-sensitive voice or chat flows where a re-ranker plus query rewrite blows the turn budget
Static FAQ-style corpora where periodic full reindexing with simple chunks is cheaper than a sophisticated pipeline
Early stage products that have not yet shipped a basic RAG baseline to real users

Why Basic RAG Fails at Scale

Chunking Strategies Compared

Fixed-Size Chunking

Sentence-Based Chunking

Semantic Chunking

Document-Aware Chunking

Parent-Child Chunking

Chunking Strategy Decision Matrix

Re-Ranking: Pushing the Best Results to the Top

Why Re-Ranking Matters

Implementation with Cohere Rerank

ColBERT: Late Interaction Re-Ranking

When to Add Re-Ranking

Query Transformation

HyDE (Hypothetical Document Embedding)

Multi-Query Expansion

Step-Back Prompting

Hybrid Retrieval Pipeline

Full Pipeline: Retrieve, Fuse, Re-Rank, Generate

Evaluation Metrics

MRR (Mean Reciprocal Rank)

NDCG (Normalized Discounted Cumulative Gain)

Recall@K

Practical Benchmarks

Production Checklist

Caching

Batching

Latency Budgets

Monitoring

Key Takeaways

When advanced RAG is the wrong investment

Frequently Asked Questions

What is the difference between basic RAG and advanced RAG?

Which chunking strategy should I use?

When should I add a re-ranker to my RAG pipeline?

How do I evaluate my RAG pipeline?

What is Reciprocal Rank Fusion and why is it used in hybrid retrieval?

Related Articles

Related Articles

Multi-Agent Orchestration for Customer Support: Architecture Guide

RAG vs Fine-Tuning for AI Chatbots: How to Choose

Preventing AI Hallucinations in Customer Support

Ready to try Chatsy?

Why Basic RAG Fails at Scale

Chunking Strategies Compared

Fixed-Size Chunking

Sentence-Based Chunking

Semantic Chunking

Document-Aware Chunking

Parent-Child Chunking

Chunking Strategy Decision Matrix

Re-Ranking: Pushing the Best Results to the Top

Why Re-Ranking Matters

Implementation with Cohere Rerank

ColBERT: Late Interaction Re-Ranking

When to Add Re-Ranking

Query Transformation

HyDE (Hypothetical Document Embedding)

Multi-Query Expansion

Step-Back Prompting

Hybrid Retrieval Pipeline

Full Pipeline: Retrieve, Fuse, Re-Rank, Generate

Evaluation Metrics

MRR (Mean Reciprocal Rank)

NDCG (Normalized Discounted Cumulative Gain)

Recall@K

Practical Benchmarks

Production Checklist

Caching

Batching

Latency Budgets

Monitoring

Key Takeaways

When advanced RAG is the wrong investment

Frequently Asked Questions

What is the difference between basic RAG and advanced RAG?

Which chunking strategy should I use?

When should I add a re-ranker to my RAG pipeline?

How do I evaluate my RAG pipeline?

What is Reciprocal Rank Fusion and why is it used in hybrid retrieval?