Chatsy

Vector Search: How AI Chatbots Find Answers

Vector search powers modern AI chatbots. Learn how it works, why it beats keyword search, and how chatbots understand what you mean.

Chatsy Team
Engineering
January 12, 2026Updated: March 30, 2026
20 min read
Share:
Featured image for article: Vector Search: How AI Chatbots Find Answers - Technical guide by Chatsy Team

When you ask an AI chatbot "How do I cancel?", how does it know you mean your subscription and not a meeting? The answer is vector search---a technology that understands meaning, not just keywords.

TL;DR:

  • Vector search converts text into numerical embeddings that capture meaning, so "cancel my plan" matches "terminate subscription" even without shared keywords --- solving the biggest limitation of traditional keyword search.
  • For production use, combine vector search with keyword search (hybrid search) to get both semantic understanding and exact-match precision, then rerank results for best relevance.
  • Optimal setup: chunk documents into 500-1000 tokens with 100-200 token overlap, use a quality embedding model (e.g., OpenAI text-embedding-3), and store vectors in a specialized database like pgvector.

Traditional search matches words. Ask "How do I cancel?" and it looks for documents containing "cancel."

But what if your help docs say "terminate subscription" or "end your plan"? Keyword search misses these completely.

You AskDoc ContainsKeyword Match?
"cancel""cancel"Yes
"cancel""terminate"No
"cancel""end subscription"No
"cancel my plan""how to cancel"Yes
"stop my subscription""cancel plan"No

This is why old chatbots felt so frustrating---slight wording differences broke everything.

Vector search converts text into embeddings---numerical representations that capture meaning. Instead of matching characters, it places text in a high-dimensional mathematical space where proximity corresponds to semantic similarity.

How Embeddings Are Generated

An embedding model is a neural network --- typically a transformer architecture --- trained on massive text corpora. During training, the model learns to map text into a fixed-length vector such that semantically similar inputs land near each other in vector space.

Here is what happens step by step when you embed a sentence:

  1. Tokenization: The input text is split into subword tokens. "How do I cancel my subscription?" might become ["How", "do", "I", "cancel", "my", "sub", "##scription", "?"].
  2. Token embedding lookup: Each token is mapped to a learned vector from a vocabulary table.
  3. Contextual encoding: The transformer passes token vectors through multiple self-attention layers, allowing each token to incorporate context from every other token. The word "cancel" in "cancel my subscription" will have a different internal representation than "cancel" in "cancel the meeting" because the surrounding tokens differ.
  4. Pooling: The final layer outputs one vector per token. These are collapsed into a single fixed-length vector for the whole input, typically via mean pooling (averaging all token vectors) or using the special [CLS] token output.
  5. Normalization: The resulting vector is often L2-normalized so that cosine similarity reduces to a simple dot product, which is faster to compute.

The output is a list of floating-point numbers --- the embedding:

"How do I cancel?" -> [0.23, -0.45, 0.67, 0.12, ...]  (1536 dimensions)
"Terminate my subscription" -> [0.21, -0.43, 0.69, 0.14, ...]
"What's the weather?" -> [-0.56, 0.34, -0.12, 0.78, ...]

Notice how similar meanings have similar numbers? That is the core insight.

Understanding Vector Dimensions

Each number in an embedding vector represents a learned feature of the text. A 1536-dimensional embedding means the model encodes 1536 distinct aspects of meaning. You can think of dimensions loosely as axes in a coordinate system:

  • Some dimensions might capture topic (business vs. technology vs. sports).
  • Others might encode sentiment, formality, or specificity.
  • Most dimensions encode abstract features that do not map to any single human concept.

Higher-dimensional embeddings capture more nuance but require more storage and slower comparison. A single 1536-dimensional vector using 32-bit floats consumes 6 KB of memory. At one million documents, that is roughly 6 GB just for the vectors.

DimensionsStorage per VectorStorage for 1M VectorsTradeoff
3841.5 KB~1.5 GBFast, less nuance
7683 KB~3 GBBalanced
15366 KB~6 GBGood nuance, moderate cost
307212 KB~12 GBMaximum nuance, highest cost

How Similarity Search Works

To find relevant content, the system compares the query vector against all stored document vectors using a distance metric. The most common metric is cosine similarity.

Cosine Similarity: The Math (Simplified)

Cosine similarity measures the angle between two vectors, ignoring their magnitude. Two vectors pointing in the same direction have a cosine similarity of 1.0 (identical meaning); orthogonal vectors score 0.0 (unrelated); opposite vectors score -1.0.

The formula for two vectors A and B:

cosine_similarity(A, B) = (A . B) / (||A|| * ||B||)

Where A . B is the dot product (multiply corresponding elements, then sum) and ||A|| is the magnitude (square root of the sum of squared elements).

Worked example with tiny 4-dimensional vectors:

A = [0.5, 0.3, 0.8, 0.1]   ("cancel my subscription")
B = [0.4, 0.35, 0.75, 0.15] ("terminate my plan")
C = [0.1, -0.6, 0.2, 0.9]  ("weather forecast today")

Dot product A . B = (0.5*0.4) + (0.3*0.35) + (0.8*0.75) + (0.1*0.15)
                  = 0.20 + 0.105 + 0.60 + 0.015
                  = 0.92

||A|| = sqrt(0.25 + 0.09 + 0.64 + 0.01) = sqrt(0.99) = 0.995
||B|| = sqrt(0.16 + 0.1225 + 0.5625 + 0.0225) = sqrt(0.8675) = 0.931

cosine_similarity(A, B) = 0.92 / (0.995 * 0.931) = 0.993  (very similar)

Dot product A . C = (0.5*0.1) + (0.3*-0.6) + (0.8*0.2) + (0.1*0.9)
                  = 0.05 - 0.18 + 0.16 + 0.09
                  = 0.12

||C|| = sqrt(0.01 + 0.36 + 0.04 + 0.81) = sqrt(1.22) = 1.105

cosine_similarity(A, C) = 0.12 / (0.995 * 1.105) = 0.109  (not similar)

The retrieval process then ranks all documents by their cosine similarity to the query and returns the top-K results:

Question Vector: [0.23, -0.45, 0.67, ...]
                    |
Compare to all document vectors
                    |
Return: "How to cancel your subscription" (similarity: 0.94)
        "Ending your plan early" (similarity: 0.89)
        "Refund policy" (similarity: 0.72)

Other distance metrics include Euclidean distance (L2) and dot product. Cosine similarity is preferred when vectors are not already normalized, while dot product is faster when they are.

Vector Search vs Keyword Search: A Concrete Example

To make the difference tangible, consider the same query run against both systems on a knowledge base containing 500 help articles.

Query: "I got charged twice on my credit card"

Keyword search results (BM25)

BM25 looks for documents containing the terms "charged", "twice", "credit", and "card". It ranks by term frequency and inverse document frequency:

RankDocument TitleWhy It Matched
1"Credit Card Payment Methods"Contains "credit card" frequently
2"Charged Sales Tax FAQ"Contains "charged"
3"Credit Card Update Instructions"Contains "credit card"
4"Twice-yearly Billing Cycle"Contains "twice"
5"Credit Limit Information"Contains "credit"

None of these are what the user needs. The actual relevant article --- "Duplicate Payment and Refund Policy" --- does not contain the phrase "charged twice" or "credit card," so it is invisible to keyword search.

Vector search results

Vector search embeds the query and finds semantically similar documents regardless of vocabulary:

RankDocument TitleSimilarity
1"Duplicate Payment and Refund Policy"0.94
2"How to Request a Refund for Double Billing"0.91
3"Billing Dispute Resolution Process"0.87
4"Payment Error Troubleshooting"0.83
5"Credit Card Payment Methods"0.71

The top results directly address the user's problem. "Duplicate payment," "double billing," and "billing dispute" are all semantically close to "charged twice" even though they share few keywords.

This example illustrates why vector search is essential for any system where users phrase questions in their own words rather than using the exact terminology in your documentation. For cases where you need both semantic understanding and exact-match precision (like error codes or product SKUs), hybrid search combines the two approaches --- read our deep dive on hybrid search.

Why Vector Search Is Better for AI Applications

Understands Synonyms

"cancel," "terminate," "end," and "stop" cluster together in vector space.

Handles Paraphrasing

"I want my money back" finds "refund policy" even without shared words.

Language-Agnostic

Modern embeddings work across languages---a Spanish question can match English docs.

Typo-Tolerant

"How do I cancle my subcription" still works because the meaning is captured.

The Technical Details

Embedding Models

Popular models for text embeddings:

ModelDimensionsBest For
OpenAI text-embedding-3-large3072General purpose
OpenAI text-embedding-3-small1536Cost-efficient general use
Cohere embed-v31024Multilingual
Voyage-31024Long documents
BGE-large1024Open source option
nomic-embed-text-v1.5768Open source, Matryoshka support

Higher dimensions = more nuance, but slower search and higher storage cost. Many newer models support Matryoshka embeddings, allowing you to truncate vectors to fewer dimensions at query time to trade accuracy for speed.

Chunking Strategy

Before embedding, documents are split into chunks:

Full Document (5000 words)
    |
Chunk 1 (500 words): "Subscription Management..."
Chunk 2 (500 words): "Cancellation Policy..."
Chunk 3 (500 words): "Refund Process..."

Why chunk?

  • LLMs have context limits
  • Smaller chunks = more precise matches
  • Better relevance scoring

Optimal chunk size: 500-1000 tokens with 100-200 token overlap. The overlap ensures that information spanning a chunk boundary is not lost.

Choosing a Vector Database

Storing and searching vectors at scale requires specialized infrastructure. Here is how the leading options compare:

FeaturepgvectorPineconeWeaviateQdrant
TypePostgreSQL extensionManaged SaaSOpen source / cloudOpen source / cloud
HostingSelf-managed or managed PostgresFully managedSelf-hosted or Weaviate CloudSelf-hosted or Qdrant Cloud
Max DimensionsUnlimited20,000UnlimitedUnlimited
Index TypesIVFFlat, HNSWProprietaryHNSWHNSW, scalar/product quantization
FilteringFull SQL WHERE clausesMetadata filtersGraphQL-based filtersPayload filters with indexed fields
Hybrid SearchCombine with pg full-text searchSparse-dense vectorsBuilt-in BM25 + vectorSparse vectors + dense vectors
ScalingVertical (or Citus for horizontal)Automatic horizontalAutomatic horizontalHorizontal sharding
PricingFree (open source)Pay per vector + queryFree self-hosted; cloud tiersFree self-hosted; cloud tiers
Best ForTeams already on Postgres; cost controlQuick start; zero opsFeature-rich search appsHigh-performance, low-latency needs

pgvector is the right choice if you already run PostgreSQL. You get vector search without adding another database to your stack. It handles millions of vectors well on a single node, and you retain full SQL capabilities for filtering, joins, and transactions. At Chatsy, we migrated from Pinecone to pgvector and reduced our infrastructure cost while gaining operational simplicity.

Pinecone is the easiest managed option. If you want zero infrastructure management and are willing to pay for it, Pinecone handles sharding, replication, and scaling automatically. The tradeoff is vendor lock-in and cost at scale.

Weaviate stands out for built-in hybrid search and a rich module ecosystem (auto-vectorization, generative search). It is a strong choice for teams building search-heavy applications that need more than raw vector similarity.

Qdrant excels in raw query performance and offers advanced quantization options that reduce memory usage by 4-32x with minimal accuracy loss. It is well-suited for latency-sensitive applications with large vector collections.

Performance at Scale

Vector search performance depends heavily on your indexing strategy. The two dominant approaches in production are IVFFlat and HNSW.

IVFFlat (Inverted File with Flat Compression)

IVFFlat partitions vectors into clusters using k-means. At query time, it searches only the nearest clusters rather than scanning every vector.

  • Build time: Fast. Indexing 1M vectors typically takes minutes.
  • Query time: Moderate. Accuracy depends on nprobe (number of clusters searched).
  • Memory: Low. Stores original vectors without a graph structure.
  • Tradeoff: You must choose nprobe at query time. Too low = missed results. Too high = slow queries.
sql
-- Create an IVFFlat index in pgvector CREATE INDEX ON chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists = 1000); -- number of clusters -- Tune search accuracy at query time SET ivfflat.probes = 10; -- search 10 of 1000 clusters

HNSW (Hierarchical Navigable Small World)

HNSW builds a multi-layer graph where each node connects to its approximate nearest neighbors. Queries traverse the graph from top (sparse) layers to bottom (dense) layers, narrowing in on the nearest vectors.

  • Build time: Slow. Indexing 1M vectors can take hours.
  • Query time: Fast. Consistently sub-millisecond for well-tuned indexes.
  • Memory: High. The graph structure adds significant overhead (roughly 2-4x the raw vector size).
  • Tradeoff: Best accuracy-speed ratio in production, but requires more memory and longer build times.
sql
-- Create an HNSW index in pgvector CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 200); -- Tune search accuracy at query time SET hnsw.ef_search = 100;

Benchmarks

Performance numbers vary by hardware, dataset, and configuration. The following are representative benchmarks from a single-node PostgreSQL instance (8 vCPU, 32 GB RAM) with 1 million 1536-dimensional vectors:

MetricIVFFlat (probes=10)IVFFlat (probes=50)HNSW (ef=100)HNSW (ef=400)
Recall@1085%95%98%99.5%
P95 Latency8 ms35 ms5 ms18 ms
Index Build Time4 min4 min90 min90 min
Index Size on Disk6.1 GB6.1 GB9.8 GB9.8 GB

For most production workloads, HNSW with ef_search=100 offers the best balance of recall and latency. Use IVFFlat when you need faster index builds (e.g., frequently updated datasets) or have tight memory constraints.

At higher scales (10M+ vectors), consider:

  • Quantization: Reduce vector size from 32-bit to 8-bit or binary, cutting memory 4-32x with 1-5% recall loss.
  • Partitioning: Shard vectors by tenant or category so each query scans a smaller index.
  • Pre-filtering: Apply metadata filters before vector comparison to reduce the candidate set.

Python Example: Embed and Query with pgvector

This example shows the complete flow: generating an embedding, storing it, and querying it.

python
from openai import OpenAI import psycopg2 import numpy as np client = OpenAI() def embed(text: str) -> list[float]: """Generate an embedding for a text string.""" response = client.embeddings.create( input=text, model="text-embedding-3-small" ) return response.data[0].embedding def cosine_similarity(a, b): """Compute cosine similarity between two vectors.""" a, b = np.array(a), np.array(b) return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) # --- Store embeddings in pgvector --- conn = psycopg2.connect("postgresql://localhost/mydb") cur = conn.cursor() # Enable pgvector and create table cur.execute("CREATE EXTENSION IF NOT EXISTS vector;") cur.execute(""" CREATE TABLE IF NOT EXISTS documents ( id SERIAL PRIMARY KEY, content TEXT NOT NULL, embedding vector(1536) ); """) # Insert documents with their embeddings docs = [ "How to cancel your subscription", "Refund policy for annual plans", "Pricing and plan comparison", "Updating your payment method", ] for doc in docs: doc_embedding = embed(doc) cur.execute( "INSERT INTO documents (content, embedding) VALUES (%s, %s)", (doc, doc_embedding) ) conn.commit() # Create an HNSW index for fast search cur.execute(""" CREATE INDEX IF NOT EXISTS documents_embedding_idx ON documents USING hnsw (embedding vector_cosine_ops) WITH (m = 16, ef_construction = 200); """) conn.commit() # --- Query --- query = "I want my money back" query_embedding = embed(query) cur.execute(""" SELECT content, 1 - (embedding <=> %s::vector) AS similarity FROM documents ORDER BY embedding <=> %s::vector LIMIT 3; """, (query_embedding, query_embedding)) results = cur.fetchall() for content, similarity in results: print(f"{similarity:.3f} {content}") # Output: # 0.891 Refund policy for annual plans # 0.834 How to cancel your subscription # 0.712 Pricing and plan comparison

JavaScript/TypeScript Example

typescript
import OpenAI from "openai"; import { Pool } from "pg"; const openai = new OpenAI(); const pool = new Pool({ connectionString: "postgresql://localhost/mydb" }); async function embed(text: string): Promise<number[]> { const response = await openai.embeddings.create({ input: text, model: "text-embedding-3-small", }); return response.data[0].embedding; } async function search(query: string, limit = 5) { const queryEmbedding = await embed(query); // pgvector uses <=> for cosine distance (1 - similarity) const result = await pool.query( `SELECT content, 1 - (embedding <=> $1::vector) AS similarity FROM documents ORDER BY embedding <=> $1::vector LIMIT $2`, [JSON.stringify(queryEmbedding), limit] ); return result.rows; } // Usage const results = await search("I want my money back"); console.log(results); // [ // { content: "Refund policy for annual plans", similarity: 0.891 }, // { content: "How to cancel your subscription", similarity: 0.834 }, // ... // ]

Production Considerations

For real applications:

  • Use a vector database (not in-memory arrays) --- see the comparison table above.
  • Implement caching for common queries. Embedding the same query repeatedly wastes time and money.
  • Add hybrid search for exact matches on codes, identifiers, and proper nouns.
  • Use async/batch operations for bulk embedding. OpenAI supports batching up to 2048 inputs per request.
  • Monitor embedding costs. At OpenAI's pricing, embedding 1M tokens with text-embedding-3-small costs roughly $0.02, but this adds up with large document sets and frequent re-indexing.
  • Version your embeddings. If you switch models or update model versions, old and new embeddings are incompatible --- you must re-embed your entire corpus.

Vector search is not perfect:

1. Exact Match Failures

"Error code E-1234" might not match "E-1234 error" well because embeddings focus on semantic meaning, not exact strings.

2. Rare Terms

Uncommon product names or technical terms may not embed well if they were underrepresented in the model's training data.

3. Negation Confusion

"I don't want to cancel" and "I want to cancel" have similar embeddings despite opposite meanings. Embedding models encode topic more strongly than logical operators.

4. Context Window Limits

Embedding models have maximum input lengths (typically 512 to 8192 tokens). Text beyond the limit is truncated, potentially losing critical information.

Combine vector and keyword search:

User Question
    |
+------------------+------------------+
| Vector Search    | Keyword Search   |
| (meaning)        | (exact terms)    |
+------------------+------------------+
    |                    |
    +--------+-----------+
             |
    Combine & Rerank
             |
    Final Results

This gets the best of both worlds---semantic understanding AND exact matching. For a thorough walkthrough of implementation and benchmarks, read Hybrid Search Explained: Best of Both Worlds.

Our retrieval-augmented generation pipeline:

  1. Query Expansion: Generate synonyms and related queries
  2. Hybrid Search: Vector + keyword across all queries
  3. Reciprocal Rank Fusion: Combine results intelligently
  4. Reranking: Use a cross-encoder for final relevance scoring
  5. Context Assembly: Select best chunks for the LLM

This multi-stage approach delivers 94%+ relevant answer rates. If you are evaluating whether to use RAG or fine-tune a model for your chatbot, see our comparison of RAG vs fine-tuning for chatbots.

Key Takeaways

  1. Vector search understands meaning, not just words
  2. Embeddings are numerical representations generated by transformer models that encode semantic similarity
  3. Cosine similarity measures the angle between vectors to determine relevance
  4. HNSW indexing provides the best speed-accuracy tradeoff for production workloads
  5. Hybrid search combines vector and keyword approaches for comprehensive retrieval
  6. pgvector offers a strong, cost-effective starting point for teams already on PostgreSQL
  7. Chunking matters --- 500-1000 tokens with overlap keeps precision high

Try It Today

Chatsy handles all this complexity for you. Upload your docs, and we automatically:

  • Chunk content optimally
  • Generate embeddings
  • Enable hybrid search
  • Apply query expansion
  • Rerank results

Experience Smart Search ->


Want more? Read about hybrid search and query expansion.


Frequently Asked Questions

Vector search is a technique that finds relevant content by meaning rather than exact keywords. It converts text into numerical embeddings (vectors) using transformer-based AI models and compares them for similarity using metrics like cosine similarity---so "cancel my plan" matches "terminate subscription" even without shared words. It powers modern AI chatbots, retrieval-augmented generation systems, and semantic search.

Keyword search matches exact words: "cancel" only finds documents containing "cancel." Vector search understands meaning: "I want my money back" finds "refund policy" even with no shared terms. Vector search handles synonyms, paraphrasing, and typos; keyword search does not. For production, combine both in hybrid search for best results.

What are embeddings?

Embeddings are numerical representations of text produced by transformer-based AI models. Each piece of text becomes a list of numbers (e.g., 1024 or 3072 dimensions) that capture its meaning. Similar meanings produce similar vectors---"cancel" and "terminate" have close embeddings. The model generates these by processing text through self-attention layers and pooling the output into a fixed-length vector. Embeddings enable similarity search across large document collections.

Vector search powers AI chatbots, semantic search in help centers, document retrieval for RAG systems, and recommendation engines. It excels when users phrase questions differently than your documentation (synonyms, paraphrasing) or when you need meaning-based matching across multilingual or long-form content.

Vector search can be fast with proper indexing (HNSW delivers sub-10ms P95 latency at 1M vectors) but typically requires more compute and memory than simple keyword lookup. For production, use a specialized vector database (pgvector, Pinecone, Weaviate, or Qdrant) and combine with keyword search in hybrid retrieval---then rerank for best relevance and performance.

Which vector database should I choose?

If you already use PostgreSQL, start with pgvector --- it avoids adding a new database to your stack and handles millions of vectors on a single node. For zero-ops managed infrastructure, consider Pinecone. For built-in hybrid search features, look at Weaviate. For maximum query performance with advanced quantization, try Qdrant. Read about our migration from Pinecone to pgvector for a real-world comparison.

What is the difference between IVFFlat and HNSW indexing?

IVFFlat partitions vectors into clusters and searches only nearby clusters at query time. It builds quickly but requires tuning the number of clusters to probe. HNSW builds a multi-layer graph connecting approximate nearest neighbors, delivering faster queries and higher recall at the cost of longer build times and more memory. For most production workloads, HNSW is the better default.


#ai#vector-search#embeddings#semantic-search#technical-guide
Related

Related Articles

Ready to try Chatsy?

Build your own AI customer support agent in minutes — no code required.

Start Free Trial