Hybrid Search Explained: Best of Both Worlds
Why combining semantic search with keyword matching gives the most accurate results for AI agents. A deep dive into implementation and optimization.

If you've built a RAG (Retrieval-Augmented Generation) system, you've faced the fundamental search dilemma: semantic search or keyword search? After extensive testing and production experience at Chatsy, we've learned that the answer is definitively both. Hybrid search combines the strengths of each approach while minimizing their weaknesses.
TL;DR:
- Hybrid search combines semantic (vector) search with keyword (BM25) matching to overcome the weaknesses of each approach used alone.
- Reciprocal Rank Fusion (RRF) merges results from both methods, and an optional AI reranking step further boosts relevance.
- In benchmarks, hybrid search improved precision by 17% and recall by 14% over semantic-only search, with minimal latency overhead.
- Dynamic weighting lets you shift the balance toward keywords for exact-match queries or toward semantics for conceptual questions.
In this technical deep dive, we'll explore why pure approaches fall short, how hybrid search works, and how to implement it effectively in your own AI applications.
The Problem with Pure Approaches
Before understanding why hybrid search works so well, let's examine why neither pure semantic search nor pure keyword search is sufficient on its own.
Semantic Search: Powerful but Imprecise
Semantic search uses embeddings to find conceptually similar content. When a user types a query, the system converts it to a vector representation, then finds documents with similar vector representations. This approach excels at understanding meaning and intent. For a deeper look at how embeddings work under the hood, see our vector search explainer.
Where semantic search shines:
- Finding paraphrases: "cancel subscription" matches "terminate membership" because they mean the same thing
- Understanding intent: "I'm unhappy with the service" correctly surfaces content about complaints and refunds
- Cross-lingual matching: Queries in one language can match documents in another
- Handling misspellings: Minor typos often still produce correct embeddings
Where semantic search fails:
-
Exact matches: Searching for "error code E-1234" might return general error handling content rather than documentation for that specific code. The model sees "error" and finds similar concepts, missing the crucial identifier.
-
Proper nouns: "John Smith account issue" might match any content about account issues with people, not specifically John Smith's account.
-
Technical identifiers: "OAuth2", "JWT", "RSA-256" might get conflated with general authentication concepts rather than matching the specific technologies.
-
Rare terms: Highly specialized vocabulary may not have strong embedding representations, leading to poor matches.
These failures aren't bugs---they're inherent to how embedding models work. They're trained to capture semantic similarity, not exact matching.
Keyword Search: Precise but Literal
Keyword search (TF-IDF, BM25) takes the opposite approach. It looks for exact term matches, weighing results by term frequency and document frequency.
Where keyword search excels:
- Exact term matching: "E-1234" finds exactly that term
- Rare word importance: Uncommon terms get high weight, which is useful for technical queries
- Speed and simplicity: No embedding generation required
- Predictability: Results are deterministic and explainable
Where keyword search fails:
- Synonyms: "cancel" won't match "terminate" unless both appear in the same document
- Typos: "refnd" won't match "refund"
- Linguistic variations: "running" might not match "run" without stemming
- Context blindness: "apple" matches both fruit and technology company content equally
These limitations make pure keyword search frustrating for users who don't know the exact terminology your documentation uses.
How BM25 Scoring Works
Since BM25 is the keyword search component in most hybrid search implementations, it is worth understanding how it ranks documents. BM25 (Best Matching 25) is a probabilistic ranking function that improves on basic TF-IDF by adding two key refinements: term frequency saturation and document length normalization.
The BM25 Formula
For a query Q containing terms q1, q2, ..., qn, the BM25 score for a document D is:
BM25(D, Q) = SUM over each term qi of:
IDF(qi) * (tf(qi, D) * (k1 + 1)) / (tf(qi, D) + k1 * (1 - b + b * |D| / avgdl))
Where:
- tf(qi, D) = how many times term qi appears in document D
- |D| = length of document D (in words)
- avgdl = average document length across the corpus
- k1 = term frequency saturation parameter (typically 1.2)
- b = document length normalization parameter (typically 0.75)
- IDF(qi) = inverse document frequency, measuring how rare the term is across all documents
What Each Component Does
IDF (Inverse Document Frequency) gives more weight to rare terms. If only 5 out of 10,000 documents contain "pgvector," that term is highly discriminative and gets a high IDF score. Common words like "the" appear in nearly every document and get a near-zero IDF.
IDF(qi) = ln((N - n(qi) + 0.5) / (n(qi) + 0.5) + 1)
Where N is the total number of documents and n(qi) is the number of documents containing the term.
Term frequency saturation (controlled by k1) means the first occurrence of a term matters most. A document mentioning "refund" 10 times is not 10x more relevant than one mentioning it once --- the score curve flattens. With the default k1=1.2, going from 1 to 2 occurrences increases the term's contribution by about 35%, but going from 9 to 10 adds less than 3%.
Document length normalization (controlled by b) penalizes long documents that match simply because they contain more words. With b=0.75, a document twice the average length needs roughly 1.5x the term frequency to score the same as an average-length document.
A Worked Example
Suppose we have a corpus of 1,000 documents with an average length of 200 words. The user searches for "cancel subscription".
Document A: 180 words, contains "cancel" 2 times and "subscription" 3 times. Document B: 400 words, contains "cancel" 4 times and "subscription" 5 times.
For the term "cancel" (appears in 50 documents):
IDF("cancel") = ln((1000 - 50 + 0.5) / (50 + 0.5) + 1) = ln(19.82) = 2.99
-- Document A (180 words):
tf_component_A = (2 * 2.2) / (2 + 1.2 * (1 - 0.75 + 0.75 * 180/200))
= 4.4 / (2 + 1.2 * 0.925)
= 4.4 / 3.11
= 1.41
-- Document B (400 words):
tf_component_B = (4 * 2.2) / (4 + 1.2 * (1 - 0.75 + 0.75 * 400/200))
= 8.8 / (4 + 1.2 * 1.75)
= 8.8 / 6.1
= 1.44
Despite Document B containing "cancel" twice as often, its BM25 contribution for that term is nearly the same as Document A because of term saturation and the penalty for being a longer document. This behavior is precisely what makes BM25 effective --- it prevents long documents from dominating results simply by mentioning terms more frequently.
The Hybrid Approach: Getting the Best of Both Worlds
Hybrid search runs both semantic and keyword searches in parallel, then combines the results intelligently. This approach captures the conceptual understanding of semantic search while maintaining the precision of keyword matching.
Here's the flow:
User Query: "How to cancel my Pro subscription?"
|
+-----------+-----------+
v v
Semantic Search Keyword Search
| |
Finds content Finds content
about canceling with exact terms
and subscriptions "Pro", "cancel"
| |
+-----------+-----------+
v
Combine Results
(Reciprocal Rank Fusion)
|
v
Re-rank with AI
|
v
Final Results
The key insight is that documents appearing in both result sets are likely highly relevant. A document that's semantically similar to the query AND contains the exact keywords is almost certainly what the user wants.
How Reciprocal Rank Fusion Works
Reciprocal Rank Fusion (RRF) is the algorithm that makes hybrid search practical. Introduced by Cormack, Clarke, and Buettcher in 2009, RRF solves a fundamental problem: how do you combine ranked lists from different scoring systems that use incompatible scales?
Why Not Just Normalize Scores?
The naive approach is to normalize scores from each system to a 0-1 range and combine them. This fails for several reasons:
- Score distributions differ: BM25 scores might range from 0 to 25, while cosine similarity ranges from -1 to 1. Min-max normalization distorts the relative gaps between results.
- Score meaning differs: A BM25 score of 15 vs. 14 might indicate a trivial difference, while cosine similarity of 0.95 vs. 0.85 could be significant.
- Outliers skew normalization: One very high-scoring document in the keyword results can compress all other scores toward zero.
RRF sidesteps all of these problems by ignoring raw scores entirely and working only with rank positions.
The RRF Formula
For each document d, the RRF score is:
RRF_score(d) = SUM over each ranked list r of:
1 / (k + rank_r(d))
Where rank_r(d) is the position of document d in ranked list r (1-indexed), and k is a constant (typically 60) that controls how quickly the score decays with rank.
Step-by-Step Worked Example
Let's trace through RRF with real data. A user queries "cancel Pro plan" and we get results from two systems.
Semantic search results (ranked by cosine similarity):
| Rank | Document | Cosine Similarity |
|---|---|---|
| 1 | Doc C: "Terminating your membership" | 0.94 |
| 2 | Doc A: "How to cancel subscription" | 0.91 |
| 3 | Doc F: "Ending your plan early" | 0.87 |
| 4 | Doc D: "Pro plan features and pricing" | 0.82 |
| 5 | Doc B: "Refund policy" | 0.78 |
Keyword search results (ranked by BM25 score):
| Rank | Document | BM25 Score |
|---|---|---|
| 1 | Doc A: "How to cancel subscription" | 18.5 |
| 2 | Doc E: "Pro plan cancellation steps" | 16.2 |
| 3 | Doc D: "Pro plan features and pricing" | 14.8 |
| 4 | Doc B: "Refund policy" | 11.3 |
| 5 | Doc G: "Plan comparison table" | 9.7 |
Step 1: Compute RRF scores (k=60)
For each document, sum 1 / (60 + rank) across both lists. If a document does not appear in a list, it contributes 0.
Doc A: 1/(60+2) + 1/(60+1) = 0.01613 + 0.01639 = 0.03252
Doc B: 1/(60+5) + 1/(60+4) = 0.01538 + 0.01563 = 0.03101
Doc C: 1/(60+1) + 0 = 0.01639 + 0 = 0.01639
Doc D: 1/(60+4) + 1/(60+3) = 0.01563 + 0.01587 = 0.03150
Doc E: 0 + 1/(60+2) = 0 + 0.01613 = 0.01613
Doc F: 1/(60+3) + 0 = 0.01587 + 0 = 0.01587
Doc G: 0 + 1/(60+5) = 0 + 0.01538 = 0.01538
Step 2: Sort by RRF score
| Final Rank | Document | RRF Score | Appeared In |
|---|---|---|---|
| 1 | Doc A: "How to cancel subscription" | 0.03252 | Both |
| 2 | Doc D: "Pro plan features and pricing" | 0.03150 | Both |
| 3 | Doc B: "Refund policy" | 0.03101 | Both |
| 4 | Doc C: "Terminating your membership" | 0.01639 | Semantic only |
| 5 | Doc E: "Pro plan cancellation steps" | 0.01613 | Keyword only |
| 6 | Doc F: "Ending your plan early" | 0.01587 | Semantic only |
| 7 | Doc G: "Plan comparison table" | 0.01538 | Keyword only |
Notice what happened: documents that appeared in both lists (A, D, B) rose to the top, while documents from only one list (C, E, F, G) ranked lower. Doc A was the top keyword result and second-best semantic result, making it the clear winner. Doc C, despite being the top semantic result, dropped to rank 4 because it had no keyword match for "Pro" or "cancel."
Why k=60?
The constant k controls the curve shape. With a smaller k (say 1), the rank-1 result gets an outsized score (1/2 = 0.5) and lower ranks contribute almost nothing. With k=60, the scores decay more gradually: rank 1 gets 0.01639, rank 10 gets 0.01429 --- still meaningful. This gentler curve means more results contribute to the final ranking, which is important when combining lists that may not agree on top results.
The original paper found k=60 performed well across diverse datasets. In practice, values between 20 and 100 work similarly. Tune it on your evaluation set if you want marginal gains.
Implementation Guide
Let's walk through a complete implementation of hybrid search suitable for production use.
Step 1: Dual Search Execution
Run both searches in parallel for optimal performance:
typescriptasync function hybridSearch(query: string, chatbotId: string) { // Generate embedding for semantic search const queryEmbedding = await embed(query); // Execute both searches in parallel const [semanticResults, keywordResults] = await Promise.all([ // Semantic search with pgvector prisma.$queryRaw` SELECT id, content, embedding <=> ${queryEmbedding}::vector AS semantic_distance FROM chunks WHERE chatbot_id = ${chatbotId} ORDER BY semantic_distance LIMIT 20 `, // Keyword search with PostgreSQL full-text search prisma.$queryRaw` SELECT id, content, ts_rank(to_tsvector('english', content), plainto_tsquery('english', ${query})) AS keyword_score FROM chunks WHERE chatbot_id = ${chatbotId} AND to_tsvector('english', content) @@ plainto_tsquery('english', ${query}) ORDER BY keyword_score DESC LIMIT 20 ` ]); return { semanticResults, keywordResults }; }
Note that semantic search returns distance (lower is better) while keyword search returns score (higher is better). We'll normalize these during fusion.
Step 2: Reciprocal Rank Fusion (RRF)
RRF is an elegant algorithm for combining ranked lists from different sources. It doesn't require score normalization because it works purely with rankings:
typescriptfunction reciprocalRankFusion( resultSets: { id: string; score: number }[][], k: number = 60 ): { id: string; score: number }[] { const scores = new Map<string, number>(); // Process each result set resultSets.forEach(resultSet => { resultSet.forEach((result, rank) => { // RRF formula: 1 / (k + rank + 1) const rrfScore = 1 / (k + rank + 1); scores.set(result.id, (scores.get(result.id) || 0) + rrfScore); }); }); // Convert to sorted array return Array.from(scores.entries()) .map(([id, score]) => ({ id, score })) .sort((a, b) => b.score - a.score); }
The k parameter (typically 60) controls how quickly scores decay with rank. Higher k values give more weight to lower-ranked results.
Step 3: AI Re-ranking (Optional but Powerful)
For the highest accuracy, re-rank the top results with a cross-encoder model or dedicated reranking API:
typescriptasync function rerank( query: string, documents: { id: string; content: string }[] ): Promise<{ id: string; score: number }[]> { // Use Cohere's reranking API const response = await cohere.rerank({ model: "rerank-english-v3.0", query, documents: documents.map(d => d.content), topN: 10 }); return response.results.map(r => ({ id: documents[r.index].id, score: r.relevance_score })); }
Re-ranking adds latency (typically 100-200ms) but significantly improves relevance, especially for ambiguous queries.
Simplified Python Implementation
For teams prototyping or working in Python, here is a self-contained implementation:
pythonfrom openai import OpenAI import psycopg2 client = OpenAI() def embed(text: str) -> list[float]: response = client.embeddings.create( input=text, model="text-embedding-3-small" ) return response.data[0].embedding def hybrid_search(query: str, chatbot_id: str, conn, k_rrf: int = 60): """Run hybrid search with RRF fusion.""" query_embedding = embed(query) cur = conn.cursor() # Semantic search (pgvector cosine distance) cur.execute(""" SELECT id, content, embedding <=> %s::vector AS distance FROM chunks WHERE chatbot_id = %s ORDER BY distance ASC LIMIT 20 """, (query_embedding, chatbot_id)) semantic_results = cur.fetchall() # Keyword search (PostgreSQL full-text search with BM25-like ranking) cur.execute(""" SELECT id, content, ts_rank(to_tsvector('english', content), plainto_tsquery('english', %s)) AS score FROM chunks WHERE chatbot_id = %s AND to_tsvector('english', content) @@ plainto_tsquery('english', %s) ORDER BY score DESC LIMIT 20 """, (query, chatbot_id, query)) keyword_results = cur.fetchall() # Reciprocal Rank Fusion rrf_scores = {} for rank, (doc_id, content, _) in enumerate(semantic_results): rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k_rrf + rank + 1) for rank, (doc_id, content, _) in enumerate(keyword_results): rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k_rrf + rank + 1) # Sort by fused score and return top results ranked = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True) return ranked[:10]
Complete Pipeline (TypeScript)
Putting it all together:
typescriptasync function search(query: string, chatbotId: string, topK: number = 5) { // 1. Get results from both search methods const { semanticResults, keywordResults } = await hybridSearch(query, chatbotId); // 2. Normalize and prepare for fusion const normalizedSemantic = semanticResults.map((r, i) => ({ id: r.id, score: 1 - r.semantic_distance // Convert distance to similarity })); const normalizedKeyword = keywordResults.map(r => ({ id: r.id, score: r.keyword_score })); // 3. Apply RRF const fusedResults = reciprocalRankFusion([normalizedSemantic, normalizedKeyword]); // 4. Fetch full documents for top candidates const candidateIds = fusedResults.slice(0, 20).map(r => r.id); const documents = await prisma.chunk.findMany({ where: { id: { in: candidateIds } }, select: { id: true, content: true } }); // 5. Re-rank with AI const rerankedResults = await rerank(query, documents); // 6. Return top K return rerankedResults.slice(0, topK); }
When to Use Hybrid Search vs Pure Vector Search
Not every application needs hybrid search. Here is a decision framework to help you choose the right approach.
Use Pure Vector Search When
- Your queries are conversational and conceptual. Users ask questions like "how does authentication work?" or "explain the refund process." These queries rarely contain specific identifiers.
- Your corpus is homogeneous. If all documents cover similar topics and length, semantic search alone discriminates well.
- Latency budget is extremely tight. Eliminating the keyword search branch saves 5-15ms, which matters in sub-50ms SLA environments.
- You have no exact-match requirements. If your domain never involves error codes, product SKUs, account numbers, or technical identifiers, keyword search adds little value.
Use Hybrid Search When
- Users mix natural language with specific identifiers. "How do I fix error JWT-401 in the Pro plan?" contains both a conceptual question and exact terms. Vector search alone would likely miss the specificity of "JWT-401" and "Pro."
- Your corpus contains technical documentation. API references, error code databases, and configuration guides all contain terms that must match exactly.
- You need high recall. Hybrid search surfaces documents that either approach alone would miss. If a relevant document does not embed well (rare terms, specialized jargon), keyword search can still find it.
- You serve diverse query types. A customer support chatbot receives everything from "I'm confused about billing" to "invoice #INV-2024-0847." Hybrid search handles both without query classification.
- Accuracy matters more than minimal latency. The 10ms overhead of parallel keyword search is negligible compared to the accuracy improvement.
Decision Matrix
| Factor | Pure Vector | Hybrid | Hybrid + Rerank |
|---|---|---|---|
| Queries are mostly conversational | Good fit | Overkill | Overkill |
| Queries contain IDs/codes/names | Poor fit | Good fit | Best fit |
| Mixed query types | Adequate | Good fit | Best fit |
| Latency budget < 50ms | Good fit | Marginal | Poor fit |
| Latency budget < 300ms | Good fit | Good fit | Good fit |
| Corpus < 10K documents | Good fit | Good fit | Best fit |
| Corpus > 1M documents | Needs tuning | Good fit | Good fit (top-K rerank) |
For most production RAG systems, hybrid search with optional reranking is the default recommendation. The accuracy gains far outweigh the minimal additional complexity and latency. If you are evaluating the broader RAG vs fine-tuning question, see our comparison of RAG and fine-tuning for chatbots.
Performance Results and Benchmarks
We benchmarked hybrid search against pure approaches on 1,000 real customer queries across diverse domains:
| Method | Precision@5 | Recall@10 | MRR | Latency (P95) |
|---|---|---|---|---|
| Semantic Only | 72% | 68% | 0.65 | 85ms |
| Keyword Only | 58% | 71% | 0.52 | 25ms |
| Hybrid (RRF) | 84% | 82% | 0.78 | 95ms |
| Hybrid + Rerank | 91% | 89% | 0.86 | 280ms |
These results were benchmarked against evaluation standards consistent with the MTEB leaderboard, using Precision@K, Recall@K, and MRR as primary metrics.
Key observations:
- Hybrid search improved precision by 17% over semantic-only
- Recall improved by 14% --- hybrid finds more relevant documents
- MRR (Mean Reciprocal Rank) improved by 20% --- relevant documents appear higher
- Latency is acceptable --- the parallel execution keeps overhead minimal
What the Research Says
The advantages of hybrid retrieval are well-documented in academic and industry research:
Ma et al. (2024) evaluated hybrid search across multiple benchmark datasets (MS MARCO, Natural Questions, BEIR) and found that RRF-based fusion consistently outperformed either sparse or dense retrieval alone, with gains of 5-20% in nDCG@10 depending on the dataset. The improvements were most pronounced on queries containing rare terms or domain-specific vocabulary.
The BEIR benchmark (Thakur et al., 2021) revealed that dense retrievers (pure vector search) underperformed BM25 on out-of-domain datasets, particularly for entity-heavy queries. Hybrid approaches that combined both signals showed more robust cross-domain performance.
Weaviate's internal benchmarks on production workloads reported that hybrid search with auto-tuned alpha (the weight between sparse and dense signals) improved relevance by 5-15% across their customer base, with the largest gains on technical documentation and support ticket corpora.
Anthropic's RAG research noted that retrieval quality is the single largest driver of RAG answer quality. A 10% improvement in retrieval recall can translate to a 15-25% improvement in end-to-end answer accuracy, making the hybrid search investment highly leveraged.
These findings align with our production experience: hybrid search is not a marginal optimization --- it is a step-change in retrieval quality for any system where query types are diverse.
Dynamic Weighting
Not every query benefits equally from each approach. Consider adapting your strategy based on query characteristics:
| Query Type | Recommended Approach | Why |
|---|---|---|
| Conceptual questions | Semantic-heavy (70/30) | "How does authentication work?" benefits from understanding concepts |
| Specific terms/codes | Keyword-heavy (30/70) | "Error JWT-401" needs exact matching |
| General questions | Balanced hybrid (50/50) | "How do I reset password?" needs both |
| Multi-part queries | Hybrid + rerank | "Cancel Pro plan and get refund" has multiple intents |
You can implement dynamic weighting by analyzing the query before search:
typescriptfunction getSearchWeights(query: string): { semantic: number; keyword: number } { const hasSpecificTerms = /[A-Z]{2,}|\d{3,}|error\s*code/i.test(query); const isConceptual = /^(what|how|why|explain)/i.test(query); if (hasSpecificTerms) return { semantic: 0.3, keyword: 0.7 }; if (isConceptual) return { semantic: 0.7, keyword: 0.3 }; return { semantic: 0.5, keyword: 0.5 }; }
For a more sophisticated approach, train a lightweight classifier on your query logs to predict the optimal weight per query. Even a simple logistic regression on query features (length, presence of numbers, question words, capitalized terms) can meaningfully improve results.
Practical Optimization Tips
After running hybrid search in production, here are our key learnings:
1. Tune the fusion parameters: The default 50/50 split works well, but test with your actual queries. Some domains benefit from different ratios. Build an evaluation set of 100+ queries with known relevant documents and measure Precision@K at different weight combinations.
2. Cache embeddings aggressively: Query embedding generation is the slowest part of semantic search (typically 50-100ms per call to an external API). Cache recent queries to avoid redundant computation. A simple LRU cache with 10,000 entries covers most repeated queries.
3. Maintain your indexes: Full-text indexes need periodic maintenance. Run REINDEX during low-traffic periods to maintain performance. For pgvector HNSW indexes, monitor recall by periodically testing against exact (sequential scan) results.
4. Monitor result quality: Track which results users actually click or which answers resolve issues. Use this data to tune your approach. Implement A/B testing between different fusion strategies.
5. Consider query expansion: Before searching, expand queries to cover synonyms and related terms. This especially helps the keyword branch catch vocabulary mismatches.
6. Pre-filter before search: Apply metadata filters (e.g., tenant ID, document category, date range) before vector comparison. This reduces the candidate set and speeds up both search branches. In pgvector, partial indexes on common filter columns can make this nearly free.
7. Use appropriate chunk sizes: If you find keyword search returning too many false positives, your chunks may be too long. If semantic search returns irrelevant results, your chunks may be too short. The 500-1000 token range with 100-200 token overlap is a strong starting point, but measure and adjust for your content. For more on chunking and embedding strategies, see our vector search guide.
Migrating to Hybrid Search
If you are currently running pure vector search and want to add keyword matching, the migration path depends on your database:
If you use pgvector (PostgreSQL): You already have full-text search built in. Add a tsvector column, create a GIN index, and run both queries in a single round-trip to the database. This is the simplest path and the one we took at Chatsy --- read about our migration from Pinecone to pgvector for lessons learned.
If you use Weaviate: Enable the built-in BM25 module alongside your vector index. Weaviate's hybrid search operator handles fusion internally with a configurable alpha parameter.
If you use Pinecone: Use sparse-dense vectors. Pinecone supports storing both dense (semantic) and sparse (keyword) vector representations, with server-side fusion.
If you use Qdrant: Combine dense vectors with sparse vectors using Qdrant's native sparse vector support. Run both queries and fuse results client-side with RRF.
Conclusion
Pure semantic search revolutionized information retrieval, but it's not the complete answer. Hybrid search combines the conceptual understanding of embeddings with the precision of keyword matching, delivering significantly better results for real-world queries.
The implementation is straightforward: run both searches in parallel, fuse with RRF, and optionally rerank the top results. The math is simple, the latency overhead is minimal, and the accuracy improvements are substantial.
At Chatsy, hybrid search is the default for all AI agents. Your chatbots automatically benefit from both approaches, optimized through thousands of hours of production experience.
For more on building effective RAG systems, check out our technical guides on vector search, query expansion, and our pgvector migration story.
Frequently Asked Questions
What is hybrid search?
Hybrid search combines semantic (vector) search with keyword (BM25) matching in parallel, then merges results using Reciprocal Rank Fusion. It captures the conceptual understanding of embeddings while maintaining the precision of exact term matching for identifiers, codes, and proper nouns.
How does hybrid search differ from keyword search?
Keyword search matches exact terms only and misses synonyms, typos, and paraphrases. Hybrid search runs both keyword and semantic search together, so you get precise matches for terms like "error E-1234" plus conceptual matches for questions like "how to cancel my subscription" --- even when your docs say "terminate membership."
Is hybrid search better than vector search alone?
In benchmarks, hybrid search improved precision by 17% and recall by 14% over semantic-only search. Vector search alone struggles with exact matches (codes, proper nouns, technical identifiers); hybrid search adds keyword matching to fix those gaps while keeping semantic understanding for conceptual queries. Research across BEIR benchmark datasets confirms these gains are consistent across domains.
How do I implement hybrid search?
Run semantic and keyword searches in parallel (e.g., pgvector for vectors, PostgreSQL full-text for keywords), combine results with Reciprocal Rank Fusion (k=60), and optionally re-rank the top candidates with a cross-encoder. Dynamic weighting lets you shift toward keywords for exact-match queries or toward semantics for conceptual questions. See the implementation sections above for complete code examples.
What is the performance impact of hybrid search?
Hybrid search adds minimal latency --- parallel execution keeps P95 latency around 95ms vs 85ms for semantic-only. Adding AI reranking increases latency to approximately 280ms but significantly improves relevance. Cache query embeddings and maintain full-text indexes to keep performance optimal.
What is Reciprocal Rank Fusion (RRF)?
RRF is an algorithm for combining ranked lists from different retrieval systems. Instead of trying to normalize incompatible scores, it uses only rank positions: each document's score is the sum of 1 / (k + rank) across all lists. Documents appearing in multiple lists get boosted. The k parameter (typically 60) controls score decay. RRF is simple to implement, requires no training, and performs competitively with more complex learned fusion methods.
When should I use pure vector search instead of hybrid?
Use pure vector search when your queries are entirely conversational and conceptual, your corpus contains no technical identifiers or codes, and you need sub-50ms latency. For most production RAG systems with diverse query types, hybrid search is the better default.