Vector Search: How AI Chatbots Find Answers
Vector search powers modern AI chatbots. Learn how it works, why it beats keyword search, and how chatbots understand what you mean.

When you ask an AI chatbot "How do I cancel?", how does it know you mean your subscription and not a meeting? The answer is vector search---a technology that understands meaning, not just keywords.
TL;DR:
- Vector search converts text into numerical embeddings that capture meaning, so "cancel my plan" matches "terminate subscription" even without shared keywords --- solving the biggest limitation of traditional keyword search.
- For production use, combine vector search with keyword search (hybrid search) to get both semantic understanding and exact-match precision, then rerank results for best relevance.
- Optimal setup: chunk documents into 500-1000 tokens with 100-200 token overlap, use a quality embedding model (e.g., OpenAI text-embedding-3), and store vectors in a specialized database like pgvector.
The Problem with Keyword Search
Traditional search matches words. Ask "How do I cancel?" and it looks for documents containing "cancel."
But what if your help docs say "terminate subscription" or "end your plan"? Keyword search misses these completely.
| You Ask | Doc Contains | Keyword Match? |
|---|---|---|
| "cancel" | "cancel" | Yes |
| "cancel" | "terminate" | No |
| "cancel" | "end subscription" | No |
| "cancel my plan" | "how to cancel" | Yes |
| "stop my subscription" | "cancel plan" | No |
This is why old chatbots felt so frustrating---slight wording differences broke everything.
Enter Vector Search
Vector search converts text into embeddings---numerical representations that capture meaning. Instead of matching characters, it places text in a high-dimensional mathematical space where proximity corresponds to semantic similarity.
How Embeddings Are Generated
An embedding model is a neural network --- typically a transformer architecture --- trained on massive text corpora. During training, the model learns to map text into a fixed-length vector such that semantically similar inputs land near each other in vector space.
Here is what happens step by step when you embed a sentence:
- Tokenization: The input text is split into subword tokens. "How do I cancel my subscription?" might become
["How", "do", "I", "cancel", "my", "sub", "##scription", "?"]. - Token embedding lookup: Each token is mapped to a learned vector from a vocabulary table.
- Contextual encoding: The transformer passes token vectors through multiple self-attention layers, allowing each token to incorporate context from every other token. The word "cancel" in "cancel my subscription" will have a different internal representation than "cancel" in "cancel the meeting" because the surrounding tokens differ.
- Pooling: The final layer outputs one vector per token. These are collapsed into a single fixed-length vector for the whole input, typically via mean pooling (averaging all token vectors) or using the special
[CLS]token output. - Normalization: The resulting vector is often L2-normalized so that cosine similarity reduces to a simple dot product, which is faster to compute.
The output is a list of floating-point numbers --- the embedding:
"How do I cancel?" -> [0.23, -0.45, 0.67, 0.12, ...] (1536 dimensions)
"Terminate my subscription" -> [0.21, -0.43, 0.69, 0.14, ...]
"What's the weather?" -> [-0.56, 0.34, -0.12, 0.78, ...]
Notice how similar meanings have similar numbers? That is the core insight.
Understanding Vector Dimensions
Each number in an embedding vector represents a learned feature of the text. A 1536-dimensional embedding means the model encodes 1536 distinct aspects of meaning. You can think of dimensions loosely as axes in a coordinate system:
- Some dimensions might capture topic (business vs. technology vs. sports).
- Others might encode sentiment, formality, or specificity.
- Most dimensions encode abstract features that do not map to any single human concept.
Higher-dimensional embeddings capture more nuance but require more storage and slower comparison. A single 1536-dimensional vector using 32-bit floats consumes 6 KB of memory. At one million documents, that is roughly 6 GB just for the vectors.
| Dimensions | Storage per Vector | Storage for 1M Vectors | Tradeoff |
|---|---|---|---|
| 384 | 1.5 KB | ~1.5 GB | Fast, less nuance |
| 768 | 3 KB | ~3 GB | Balanced |
| 1536 | 6 KB | ~6 GB | Good nuance, moderate cost |
| 3072 | 12 KB | ~12 GB | Maximum nuance, highest cost |
How Similarity Search Works
To find relevant content, the system compares the query vector against all stored document vectors using a distance metric. The most common metric is cosine similarity.
Cosine Similarity: The Math (Simplified)
Cosine similarity measures the angle between two vectors, ignoring their magnitude. Two vectors pointing in the same direction have a cosine similarity of 1.0 (identical meaning); orthogonal vectors score 0.0 (unrelated); opposite vectors score -1.0.
The formula for two vectors A and B:
cosine_similarity(A, B) = (A . B) / (||A|| * ||B||)
Where A . B is the dot product (multiply corresponding elements, then sum) and ||A|| is the magnitude (square root of the sum of squared elements).
Worked example with tiny 4-dimensional vectors:
A = [0.5, 0.3, 0.8, 0.1] ("cancel my subscription")
B = [0.4, 0.35, 0.75, 0.15] ("terminate my plan")
C = [0.1, -0.6, 0.2, 0.9] ("weather forecast today")
Dot product A . B = (0.5*0.4) + (0.3*0.35) + (0.8*0.75) + (0.1*0.15)
= 0.20 + 0.105 + 0.60 + 0.015
= 0.92
||A|| = sqrt(0.25 + 0.09 + 0.64 + 0.01) = sqrt(0.99) = 0.995
||B|| = sqrt(0.16 + 0.1225 + 0.5625 + 0.0225) = sqrt(0.8675) = 0.931
cosine_similarity(A, B) = 0.92 / (0.995 * 0.931) = 0.993 (very similar)
Dot product A . C = (0.5*0.1) + (0.3*-0.6) + (0.8*0.2) + (0.1*0.9)
= 0.05 - 0.18 + 0.16 + 0.09
= 0.12
||C|| = sqrt(0.01 + 0.36 + 0.04 + 0.81) = sqrt(1.22) = 1.105
cosine_similarity(A, C) = 0.12 / (0.995 * 1.105) = 0.109 (not similar)
The retrieval process then ranks all documents by their cosine similarity to the query and returns the top-K results:
Question Vector: [0.23, -0.45, 0.67, ...]
|
Compare to all document vectors
|
Return: "How to cancel your subscription" (similarity: 0.94)
"Ending your plan early" (similarity: 0.89)
"Refund policy" (similarity: 0.72)
Other distance metrics include Euclidean distance (L2) and dot product. Cosine similarity is preferred when vectors are not already normalized, while dot product is faster when they are.
Vector Search vs Keyword Search: A Concrete Example
To make the difference tangible, consider the same query run against both systems on a knowledge base containing 500 help articles.
Query: "I got charged twice on my credit card"
Keyword search results (BM25)
BM25 looks for documents containing the terms "charged", "twice", "credit", and "card". It ranks by term frequency and inverse document frequency:
| Rank | Document Title | Why It Matched |
|---|---|---|
| 1 | "Credit Card Payment Methods" | Contains "credit card" frequently |
| 2 | "Charged Sales Tax FAQ" | Contains "charged" |
| 3 | "Credit Card Update Instructions" | Contains "credit card" |
| 4 | "Twice-yearly Billing Cycle" | Contains "twice" |
| 5 | "Credit Limit Information" | Contains "credit" |
None of these are what the user needs. The actual relevant article --- "Duplicate Payment and Refund Policy" --- does not contain the phrase "charged twice" or "credit card," so it is invisible to keyword search.
Vector search results
Vector search embeds the query and finds semantically similar documents regardless of vocabulary:
| Rank | Document Title | Similarity |
|---|---|---|
| 1 | "Duplicate Payment and Refund Policy" | 0.94 |
| 2 | "How to Request a Refund for Double Billing" | 0.91 |
| 3 | "Billing Dispute Resolution Process" | 0.87 |
| 4 | "Payment Error Troubleshooting" | 0.83 |
| 5 | "Credit Card Payment Methods" | 0.71 |
The top results directly address the user's problem. "Duplicate payment," "double billing," and "billing dispute" are all semantically close to "charged twice" even though they share few keywords.
This example illustrates why vector search is essential for any system where users phrase questions in their own words rather than using the exact terminology in your documentation. For cases where you need both semantic understanding and exact-match precision (like error codes or product SKUs), hybrid search combines the two approaches --- read our deep dive on hybrid search.
Why Vector Search Is Better for AI Applications
Understands Synonyms
"cancel," "terminate," "end," and "stop" cluster together in vector space.
Handles Paraphrasing
"I want my money back" finds "refund policy" even without shared words.
Language-Agnostic
Modern embeddings work across languages---a Spanish question can match English docs.
Typo-Tolerant
"How do I cancle my subcription" still works because the meaning is captured.
The Technical Details
Embedding Models
Popular models for text embeddings:
| Model | Dimensions | Best For |
|---|---|---|
| OpenAI text-embedding-3-large | 3072 | General purpose |
| OpenAI text-embedding-3-small | 1536 | Cost-efficient general use |
| Cohere embed-v3 | 1024 | Multilingual |
| Voyage-3 | 1024 | Long documents |
| BGE-large | 1024 | Open source option |
| nomic-embed-text-v1.5 | 768 | Open source, Matryoshka support |
Higher dimensions = more nuance, but slower search and higher storage cost. Many newer models support Matryoshka embeddings, allowing you to truncate vectors to fewer dimensions at query time to trade accuracy for speed.
Chunking Strategy
Before embedding, documents are split into chunks:
Full Document (5000 words)
|
Chunk 1 (500 words): "Subscription Management..."
Chunk 2 (500 words): "Cancellation Policy..."
Chunk 3 (500 words): "Refund Process..."
Why chunk?
- LLMs have context limits
- Smaller chunks = more precise matches
- Better relevance scoring
Optimal chunk size: 500-1000 tokens with 100-200 token overlap. The overlap ensures that information spanning a chunk boundary is not lost.
Choosing a Vector Database
Storing and searching vectors at scale requires specialized infrastructure. Here is how the leading options compare:
| Feature | pgvector | Pinecone | Weaviate | Qdrant |
|---|---|---|---|---|
| Type | PostgreSQL extension | Managed SaaS | Open source / cloud | Open source / cloud |
| Hosting | Self-managed or managed Postgres | Fully managed | Self-hosted or Weaviate Cloud | Self-hosted or Qdrant Cloud |
| Max Dimensions | Unlimited | 20,000 | Unlimited | Unlimited |
| Index Types | IVFFlat, HNSW | Proprietary | HNSW | HNSW, scalar/product quantization |
| Filtering | Full SQL WHERE clauses | Metadata filters | GraphQL-based filters | Payload filters with indexed fields |
| Hybrid Search | Combine with pg full-text search | Sparse-dense vectors | Built-in BM25 + vector | Sparse vectors + dense vectors |
| Scaling | Vertical (or Citus for horizontal) | Automatic horizontal | Automatic horizontal | Horizontal sharding |
| Pricing | Free (open source) | Pay per vector + query | Free self-hosted; cloud tiers | Free self-hosted; cloud tiers |
| Best For | Teams already on Postgres; cost control | Quick start; zero ops | Feature-rich search apps | High-performance, low-latency needs |
pgvector is the right choice if you already run PostgreSQL. You get vector search without adding another database to your stack. It handles millions of vectors well on a single node, and you retain full SQL capabilities for filtering, joins, and transactions. At Chatsy, we migrated from Pinecone to pgvector and reduced our infrastructure cost while gaining operational simplicity.
Pinecone is the easiest managed option. If you want zero infrastructure management and are willing to pay for it, Pinecone handles sharding, replication, and scaling automatically. The tradeoff is vendor lock-in and cost at scale.
Weaviate stands out for built-in hybrid search and a rich module ecosystem (auto-vectorization, generative search). It is a strong choice for teams building search-heavy applications that need more than raw vector similarity.
Qdrant excels in raw query performance and offers advanced quantization options that reduce memory usage by 4-32x with minimal accuracy loss. It is well-suited for latency-sensitive applications with large vector collections.
Performance at Scale
Vector search performance depends heavily on your indexing strategy. The two dominant approaches in production are IVFFlat and HNSW.
IVFFlat (Inverted File with Flat Compression)
IVFFlat partitions vectors into clusters using k-means. At query time, it searches only the nearest clusters rather than scanning every vector.
- Build time: Fast. Indexing 1M vectors typically takes minutes.
- Query time: Moderate. Accuracy depends on
nprobe(number of clusters searched). - Memory: Low. Stores original vectors without a graph structure.
- Tradeoff: You must choose
nprobeat query time. Too low = missed results. Too high = slow queries.
sql-- Create an IVFFlat index in pgvector CREATE INDEX ON chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 1000); -- number of clusters -- Tune search accuracy at query time SET ivfflat.probes = 10; -- search 10 of 1000 clusters
HNSW (Hierarchical Navigable Small World)
HNSW builds a multi-layer graph where each node connects to its approximate nearest neighbors. Queries traverse the graph from top (sparse) layers to bottom (dense) layers, narrowing in on the nearest vectors.
- Build time: Slow. Indexing 1M vectors can take hours.
- Query time: Fast. Consistently sub-millisecond for well-tuned indexes.
- Memory: High. The graph structure adds significant overhead (roughly 2-4x the raw vector size).
- Tradeoff: Best accuracy-speed ratio in production, but requires more memory and longer build times.
sql-- Create an HNSW index in pgvector CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 200); -- Tune search accuracy at query time SET hnsw.ef_search = 100;
Benchmarks
Performance numbers vary by hardware, dataset, and configuration. The following are representative benchmarks from a single-node PostgreSQL instance (8 vCPU, 32 GB RAM) with 1 million 1536-dimensional vectors:
| Metric | IVFFlat (probes=10) | IVFFlat (probes=50) | HNSW (ef=100) | HNSW (ef=400) |
|---|---|---|---|---|
| Recall@10 | 85% | 95% | 98% | 99.5% |
| P95 Latency | 8 ms | 35 ms | 5 ms | 18 ms |
| Index Build Time | 4 min | 4 min | 90 min | 90 min |
| Index Size on Disk | 6.1 GB | 6.1 GB | 9.8 GB | 9.8 GB |
For most production workloads, HNSW with ef_search=100 offers the best balance of recall and latency. Use IVFFlat when you need faster index builds (e.g., frequently updated datasets) or have tight memory constraints.
At higher scales (10M+ vectors), consider:
- Quantization: Reduce vector size from 32-bit to 8-bit or binary, cutting memory 4-32x with 1-5% recall loss.
- Partitioning: Shard vectors by tenant or category so each query scans a smaller index.
- Pre-filtering: Apply metadata filters before vector comparison to reduce the candidate set.
Implementing Vector Search
Python Example: Embed and Query with pgvector
This example shows the complete flow: generating an embedding, storing it, and querying it.
pythonfrom openai import OpenAI import psycopg2 import numpy as np client = OpenAI() def embed(text: str) -> list[float]: """Generate an embedding for a text string.""" response = client.embeddings.create( input=text, model="text-embedding-3-small" ) return response.data[0].embedding def cosine_similarity(a, b): """Compute cosine similarity between two vectors.""" a, b = np.array(a), np.array(b) return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) # --- Store embeddings in pgvector --- conn = psycopg2.connect("postgresql://localhost/mydb") cur = conn.cursor() # Enable pgvector and create table cur.execute("CREATE EXTENSION IF NOT EXISTS vector;") cur.execute(""" CREATE TABLE IF NOT EXISTS documents ( id SERIAL PRIMARY KEY, content TEXT NOT NULL, embedding vector(1536) ); """) # Insert documents with their embeddings docs = [ "How to cancel your subscription", "Refund policy for annual plans", "Pricing and plan comparison", "Updating your payment method", ] for doc in docs: doc_embedding = embed(doc) cur.execute( "INSERT INTO documents (content, embedding) VALUES (%s, %s)", (doc, doc_embedding) ) conn.commit() # Create an HNSW index for fast search cur.execute(""" CREATE INDEX IF NOT EXISTS documents_embedding_idx ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 200); """) conn.commit() # --- Query --- query = "I want my money back" query_embedding = embed(query) cur.execute(""" SELECT content, 1 - (embedding <=> %s::vector) AS similarity FROM documents ORDER BY embedding <=> %s::vector LIMIT 3; """, (query_embedding, query_embedding)) results = cur.fetchall() for content, similarity in results: print(f"{similarity:.3f} {content}") # Output: # 0.891 Refund policy for annual plans # 0.834 How to cancel your subscription # 0.712 Pricing and plan comparison
JavaScript/TypeScript Example
typescriptimport OpenAI from "openai"; import { Pool } from "pg"; const openai = new OpenAI(); const pool = new Pool({ connectionString: "postgresql://localhost/mydb" }); async function embed(text: string): Promise<number[]> { const response = await openai.embeddings.create({ input: text, model: "text-embedding-3-small", }); return response.data[0].embedding; } async function search(query: string, limit = 5) { const queryEmbedding = await embed(query); // pgvector uses <=> for cosine distance (1 - similarity) const result = await pool.query( `SELECT content, 1 - (embedding <=> $1::vector) AS similarity FROM documents ORDER BY embedding <=> $1::vector LIMIT $2`, [JSON.stringify(queryEmbedding), limit] ); return result.rows; } // Usage const results = await search("I want my money back"); console.log(results); // [ // { content: "Refund policy for annual plans", similarity: 0.891 }, // { content: "How to cancel your subscription", similarity: 0.834 }, // ... // ]
Production Considerations
For real applications:
- Use a vector database (not in-memory arrays) --- see the comparison table above.
- Implement caching for common queries. Embedding the same query repeatedly wastes time and money.
- Add hybrid search for exact matches on codes, identifiers, and proper nouns.
- Use async/batch operations for bulk embedding. OpenAI supports batching up to 2048 inputs per request.
- Monitor embedding costs. At OpenAI's pricing, embedding 1M tokens with text-embedding-3-small costs roughly $0.02, but this adds up with large document sets and frequent re-indexing.
- Version your embeddings. If you switch models or update model versions, old and new embeddings are incompatible --- you must re-embed your entire corpus.
Limitations of Pure Vector Search
Vector search is not perfect:
1. Exact Match Failures
"Error code E-1234" might not match "E-1234 error" well because embeddings focus on semantic meaning, not exact strings.
2. Rare Terms
Uncommon product names or technical terms may not embed well if they were underrepresented in the model's training data.
3. Negation Confusion
"I don't want to cancel" and "I want to cancel" have similar embeddings despite opposite meanings. Embedding models encode topic more strongly than logical operators.
4. Context Window Limits
Embedding models have maximum input lengths (typically 512 to 8192 tokens). Text beyond the limit is truncated, potentially losing critical information.
The Solution: Hybrid Search
Combine vector and keyword search:
User Question
|
+------------------+------------------+
| Vector Search | Keyword Search |
| (meaning) | (exact terms) |
+------------------+------------------+
| |
+--------+-----------+
|
Combine & Rerank
|
Final Results
This gets the best of both worlds---semantic understanding AND exact matching. For a thorough walkthrough of implementation and benchmarks, read Hybrid Search Explained: Best of Both Worlds.
How Chatsy Uses Vector Search
Our retrieval-augmented generation pipeline:
- Query Expansion: Generate synonyms and related queries
- Hybrid Search: Vector + keyword across all queries
- Reciprocal Rank Fusion: Combine results intelligently
- Reranking: Use a cross-encoder for final relevance scoring
- Context Assembly: Select best chunks for the LLM
This multi-stage approach delivers 94%+ relevant answer rates. If you are evaluating whether to use RAG or fine-tune a model for your chatbot, see our comparison of RAG vs fine-tuning for chatbots.
Key Takeaways
- Vector search understands meaning, not just words
- Embeddings are numerical representations generated by transformer models that encode semantic similarity
- Cosine similarity measures the angle between vectors to determine relevance
- HNSW indexing provides the best speed-accuracy tradeoff for production workloads
- Hybrid search combines vector and keyword approaches for comprehensive retrieval
- pgvector offers a strong, cost-effective starting point for teams already on PostgreSQL
- Chunking matters --- 500-1000 tokens with overlap keeps precision high
Try It Today
Chatsy handles all this complexity for you. Upload your docs, and we automatically:
- Chunk content optimally
- Generate embeddings
- Enable hybrid search
- Apply query expansion
- Rerank results
Want more? Read about hybrid search and query expansion.
Frequently Asked Questions
What is vector search?
Vector search is a technique that finds relevant content by meaning rather than exact keywords. It converts text into numerical embeddings (vectors) using transformer-based AI models and compares them for similarity using metrics like cosine similarity---so "cancel my plan" matches "terminate subscription" even without shared words. It powers modern AI chatbots, retrieval-augmented generation systems, and semantic search.
How does vector search differ from keyword search?
Keyword search matches exact words: "cancel" only finds documents containing "cancel." Vector search understands meaning: "I want my money back" finds "refund policy" even with no shared terms. Vector search handles synonyms, paraphrasing, and typos; keyword search does not. For production, combine both in hybrid search for best results.
What are embeddings?
Embeddings are numerical representations of text produced by transformer-based AI models. Each piece of text becomes a list of numbers (e.g., 1024 or 3072 dimensions) that capture its meaning. Similar meanings produce similar vectors---"cancel" and "terminate" have close embeddings. The model generates these by processing text through self-attention layers and pooling the output into a fixed-length vector. Embeddings enable similarity search across large document collections.
What are the main use cases for vector search?
Vector search powers AI chatbots, semantic search in help centers, document retrieval for RAG systems, and recommendation engines. It excels when users phrase questions differently than your documentation (synonyms, paraphrasing) or when you need meaning-based matching across multilingual or long-form content.
Is vector search faster than keyword search?
Vector search can be fast with proper indexing (HNSW delivers sub-10ms P95 latency at 1M vectors) but typically requires more compute and memory than simple keyword lookup. For production, use a specialized vector database (pgvector, Pinecone, Weaviate, or Qdrant) and combine with keyword search in hybrid retrieval---then rerank for best relevance and performance.
Which vector database should I choose?
If you already use PostgreSQL, start with pgvector --- it avoids adding a new database to your stack and handles millions of vectors on a single node. For zero-ops managed infrastructure, consider Pinecone. For built-in hybrid search features, look at Weaviate. For maximum query performance with advanced quantization, try Qdrant. Read about our migration from Pinecone to pgvector for a real-world comparison.
What is the difference between IVFFlat and HNSW indexing?
IVFFlat partitions vectors into clusters and searches only nearby clusters at query time. It builds quickly but requires tuning the number of clusters to probe. HNSW builds a multi-layer graph connecting approximate nearest neighbors, delivering faster queries and higher recall at the cost of longer build times and more memory. For most production workloads, HNSW is the better default.