Building for Scale: Handling Millions of Docs
How we scaled our RAG system from 50 to 2 million documents using pgvector partitioning, background jobs, and response caching — cutting costs 84%.

When we launched Chatsy, our first customer had 50 documents. Today, our largest enterprise customer has over 2 million. Here's how we built a system that scales seamlessly across that range.
TL;DR:
- Scaling a RAG system means solving three bottlenecks simultaneously: document ingestion, vector search, and AI generation.
- Partitioned indexes by tenant keep vector search at a consistent ~50ms regardless of total document count, and background job processing handles ingestion at 50 docs/sec.
- Response caching, streaming, and smart model routing cut API response time from 8s to 2s and infrastructure costs from $5,000 to $800/month.
- Key lesson: partition early, background everything, cache aggressively, and stream responses to minimize perceived latency.
The Challenge
Scaling a RAG (Retrieval-Augmented Generation) system is uniquely challenging because you're scaling three things simultaneously:
- Document ingestion — Processing and embedding documents
- Vector search — Finding relevant content quickly
- AI generation — Producing high-quality responses
Each has different scaling characteristics and bottlenecks. A naive implementation that works at 1,000 documents will fall apart at 100,000. And what works at 100K will grind to a halt at 2M. The strategies you need shift at each order of magnitude.
Architecture Overview
Our production architecture separates concerns into three layers: request handling, data persistence, and background processing. Each layer scales independently.
┌─────────────────────────────────────────────────────┐
│ Load Balancer │
│ (health checks, SSL term) │
└─────────────────────────────────────────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ API Pod │ │ API Pod │ │ API Pod │
│ (stateless) │ │ (stateless) │ │ (stateless) │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
└─────────────────┼─────────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ PostgreSQL │ │ Redis │ │ Trigger │
│ + pgvector │ │ Cache │ │ .dev │
└──────────────┘ └──────────────┘ └──────────────┘
Why This Topology Works
- API pods are stateless. Any pod can handle any request. Scaling horizontally means adding pods behind the load balancer.
- PostgreSQL with pgvector handles both relational data and vector embeddings in one database, eliminating the need to synchronize a separate vector store.
- Redis sits between the API and both PostgreSQL and OpenAI, caching embeddings, search results, and generated responses.
- Trigger.dev manages all background work: document processing, batch embedding, scheduled cleanup, and webhook delivery.
The key architectural decision was keeping everything in PostgreSQL rather than running a separate vector database. This simplifies operations, reduces infrastructure cost, and lets us use standard database tooling for backups, migrations, and monitoring. Our migration from Pinecone to pgvector covers the reasoning in depth.
Scaling Document Ingestion
The Ingestion Bottleneck
Document processing is CPU and memory intensive:
- PDF parsing and text extraction
- Content cleaning and normalization
- Intelligent chunking with overlap
- Embedding generation (API calls)
- Database writes with vector indexing
A single 100-page PDF can take 30+ seconds to process. When a customer uploads 500 documents at once, that's over 4 hours of sequential processing. Users expect their content to be searchable in minutes, not hours.
Our Solution: Background Jobs with Trigger.dev
We use Trigger.dev for reliable, scalable background processing:
typescriptexport const processDocument = task({ id: "process-document", retry: { maxAttempts: 3 }, run: async ({ documentId }) => { // 1. Extract text const text = await extractText(documentId); // 2. Chunk intelligently const chunks = await chunkText(text, { maxTokens: 512, overlap: 50, }); // 3. Generate embeddings in batches for (const batch of chunk(chunks, 100)) { const embeddings = await openai.embeddings.create({ model: "text-embedding-3-small", input: batch.map(c => c.text), }); // 4. Store in database await prisma.chunk.createMany({ data: batch.map((chunk, i) => ({ documentId, content: chunk.text, embedding: embeddings.data[i].embedding, })), }); } }, });
This approach gives us:
- Horizontal scaling: Add more workers as needed
- Reliability: Automatic retries on failure
- Visibility: Track progress in real-time
Batch Processing for Bulk Uploads
When customers import hundreds of documents at once, we use a batch orchestration task that fans out work across multiple workers:
typescriptexport const batchImport = task({ id: "batch-import", run: async ({ chatbotId, documentIds }) => { // Process in parallel batches of 10 const batches = chunk(documentIds, 10); for (const batch of batches) { await Promise.all( batch.map((docId) => processDocument.triggerAndWait({ documentId: docId }) ) ); // Update progress for the UI await updateImportProgress(chatbotId, { processed: batch.length, total: documentIds.length, }); } }, });
At 10 concurrent workers, a 500-document upload completes in roughly 25 minutes instead of 4 hours. The progress updates feed into a real-time UI so customers can see exactly where their import stands.
Queue Management and Backpressure
At scale, you need to handle bursts without overwhelming downstream services. Our queue strategy uses three priority levels:
- High priority -- Real-time document updates (customer edits an article, expects immediate re-indexing)
- Normal priority -- Standard uploads and imports
- Low priority -- Scheduled re-indexing, cleanup tasks, and bulk migrations
When the embedding API rate limit approaches, we apply backpressure by slowing the dequeue rate rather than failing jobs. Trigger.dev's built-in concurrency controls handle this cleanly.
Scaling Vector Search
The Search Latency Problem
As document count grows, search latency increases. Without partitioning, on a single flat index:
- 10K chunks: ~20ms
- 100K chunks: ~50ms
- 1M chunks: ~200ms
- 10M chunks: ~2000ms (unacceptable for real-time chat)
The reason is straightforward: IVFFlat indexes scan a fixed percentage of clusters. More vectors means more data to scan per query.
Our Solution: Partitioned Indexes
We partition vectors by tenant (chatbot), a strategy refined during our migration from Pinecone to pgvector:
sql-- Create partitioned table CREATE TABLE chunks ( id UUID PRIMARY KEY, chatbot_id UUID NOT NULL, content TEXT, embedding vector(1536) ) PARTITION BY HASH (chatbot_id); -- Create partitions CREATE TABLE chunks_p0 PARTITION OF chunks FOR VALUES WITH (MODULUS 16, REMAINDER 0); -- ... repeat for 16 partitions -- Index each partition CREATE INDEX ON chunks_p0 USING ivfflat (embedding vector_cosine_ops);
Now queries only scan relevant partitions:
sqlSELECT content, embedding <=> $1 AS distance FROM chunks WHERE chatbot_id = $2 -- Only scans relevant partition ORDER BY distance LIMIT 10;
Result: Consistent ~50ms queries regardless of total document count.
Database Sharding Strategy
At 2M+ documents, even partitioned indexes benefit from careful tuning. Our sharding approach follows three principles:
- Tenant isolation via hash partitioning. Each chatbot's data lands in one of 16 hash partitions. This prevents large tenants from degrading search for small ones.
- Index tuning per partition. IVFFlat
listsparameter scales with the number of vectors in each partition. We run a nightly job that checks partition sizes and rebuilds indexes when the vector count crosses a threshold (e.g., 100K vectors triggers an increase from 100 to 250 lists). - Connection pooling. PgBouncer sits between the API pods and PostgreSQL, keeping the connection count manageable even under high concurrency. Without it, 50 concurrent API pods would exhaust PostgreSQL's max connections.
┌──────────┐ ┌────────────┐ ┌────────────────┐
│ API Pods │ ──▶ │ PgBouncer │ ──▶ │ PostgreSQL │
│ (50+) │ │ (pool: 20) │ │ (max_conn: 50)│
└──────────┘ └────────────┘ └────────────────┘
HNSW vs IVFFlat: When to Switch
We started with IVFFlat because it's faster to build and works well at moderate scale. As partition sizes grow, HNSW becomes attractive for its better recall-latency trade-off:
| Factor | IVFFlat | HNSW |
|---|---|---|
| Build time | Fast (minutes) | Slow (hours at 1M+) |
| Query latency | Good with tuning | Consistently fast |
| Memory usage | Lower | Higher (~2x) |
| Insert performance | Fast | Moderate |
| Best for | < 500K vectors/partition | > 500K vectors/partition |
For most of our tenants, IVFFlat is the right choice. We reserve HNSW for enterprise customers with 500K+ chunks where the marginal query latency improvement justifies the memory cost.
The Caching Layer
Caching is the single biggest cost reducer in our stack. We cache at three levels:
Level 1: Embedding Cache (Redis)
Generating embeddings costs money and adds latency. When the same text chunk is re-indexed (e.g., a document is re-imported without changes), we skip the API call:
typescriptasync function getEmbedding(text: string): Promise<number[]> { const cacheKey = `emb:${hash(text)}`; const cached = await redis.get(cacheKey); if (cached) return JSON.parse(cached); const result = await openai.embeddings.create({ model: "text-embedding-3-small", input: text, }); const embedding = result.data[0].embedding; await redis.setex(cacheKey, 86400 * 7, JSON.stringify(embedding)); // 7 days return embedding; }
This alone cut our embedding API costs by 35% because many customers re-import documents during testing and setup.
Level 2: Search Result Cache (Redis)
Identical queries to the same chatbot often return the same results. We cache the retrieved chunks for a short TTL:
typescriptconst searchCacheKey = `search:${chatbotId}:${hash(queryEmbedding)}`; const cachedResults = await redis.get(searchCacheKey); if (cachedResults) return JSON.parse(cachedResults); const results = await vectorSearch(chatbotId, queryEmbedding); await redis.setex(searchCacheKey, 300, JSON.stringify(results)); // 5 min
The short TTL (5 minutes) ensures freshness while still catching the common pattern of multiple users asking the same question within a short window.
Level 3: Response Cache (Redis)
The most impactful cache. Full LLM responses for identical question + context combinations:
typescriptconst responseCacheKey = `resp:${hash(question + contextChunks)}`; const cached = await redis.get(responseCacheKey); if (cached) return cached; const response = await generateResponse(question, contextChunks); await redis.setex(responseCacheKey, 3600, response); // 1 hour return response;
Hit rates vary by use case: FAQ-heavy chatbots see 40-60% cache hits. Support bots with unique account-specific questions see 10-15%. Even 10% saves meaningful cost at scale.
Scaling AI Generation
The Generation Cost Problem
LLM API calls are:
- Expensive (~$0.01-0.10 per request)
- Slow (1-10 seconds)
- Rate-limited
Our Solutions
1. Streaming Responses Don't wait for the full response -- stream tokens to the client:
typescriptconst stream = await openai.chat.completions.create({ model: "gpt-4o", messages, stream: true, }); for await (const chunk of stream) { controller.enqueue(chunk.choices[0]?.delta?.content || ""); }
Streaming cuts perceived latency dramatically. The user sees the first token in ~300ms instead of waiting 2-5 seconds for the complete response. This makes a bigger difference to user satisfaction than raw speed improvements.
2. Model Routing Use cheaper models for simple questions:
typescriptconst complexity = await classifyComplexity(question); const model = complexity === "simple" ? "gpt-4o-mini" : "gpt-4o";
In practice, 60-70% of support questions are straightforward enough for GPT-4o-mini. Routing saves roughly 50% on LLM costs compared to using GPT-4o for everything.
3. Rate Limit Management At scale, you'll hit OpenAI rate limits. We handle this with a token bucket implementation that queues requests when approaching limits and falls back to a secondary model provider if the primary is throttled.
Results
| Metric | Before | After |
|---|---|---|
| Ingestion Speed | 1 doc/sec | 50 docs/sec |
| Search Latency (P99) | 2000ms | 80ms |
| API Response Time | 8s | 2s |
| Monthly Infrastructure | $5,000 | $800 |
Monitoring and Alerting
You cannot optimize what you cannot measure. Our monitoring setup tracks four categories of metrics:
Infrastructure Metrics
| Metric | Tool | Alert Threshold |
|---|---|---|
| API response time (P50, P95, P99) | Application logs | P99 > 3s |
| Database connection pool usage | PgBouncer stats | > 80% utilization |
| Redis memory usage | Redis INFO | > 75% of max memory |
| Background job queue depth | Trigger.dev dashboard | > 1,000 pending jobs |
Business Metrics
| Metric | What It Tells You |
|---|---|
| Docs processed per minute | Ingestion pipeline health |
| Search latency by tenant | Whether a large tenant is degrading performance |
| Cache hit rate by level | Whether your caching strategy is effective |
| LLM cost per conversation | Whether model routing is working |
Alerting Strategy
We use a tiered alerting approach to avoid fatigue:
- P1 (page immediately): Search latency P99 > 3s, API error rate > 5%, database connection exhaustion
- P2 (Slack alert, respond in 1 hour): Ingestion queue depth > 5,000, cache hit rate drops below 20%, LLM error rate > 2%
- P3 (daily review): Cost anomalies, slow partition index builds, elevated retry rates
The single most useful alert we added was on per-tenant search latency. When a customer imports a massive document set, their partition's index can become under-tuned. The alert catches this before the customer notices.
Real Benchmarks
Here is what our production numbers look like across different scale tiers:
| Scale Tier | Docs | Chunks | Avg Search (P50) | Search (P99) | Ingestion Rate |
|---|---|---|---|---|---|
| Starter | < 1K | < 10K | 8ms | 25ms | 5 docs/sec |
| Growth | 1K-50K | 10K-500K | 15ms | 45ms | 20 docs/sec |
| Scale | 50K-500K | 500K-5M | 25ms | 65ms | 50 docs/sec |
| Enterprise | 500K-2M+ | 5M-20M+ | 35ms | 80ms | 50 docs/sec |
These numbers hold because each tenant's vectors are isolated in their own partition. A single enterprise tenant with 2M documents does not affect the search latency of a starter tenant with 500 documents.
Key Takeaways
- Partition early -- Retrofitting partitioning is painful. Design for tenant isolation from day one.
- Background everything -- Never block the user on heavy operations. Document processing, embedding generation, and index rebuilds all belong in background jobs.
- Cache at every layer -- Embedding cache, search cache, and response cache each target a different bottleneck. Combined, they cut costs by 60-70%.
- Stream responses -- Perceived latency matters as much as actual latency. First-token time of 300ms feels instant even if the full response takes 3 seconds.
- Monitor per-tenant -- Aggregate metrics hide problems. A single large tenant can degrade their own experience without affecting averages.
- Right-size your indexes -- IVFFlat works for most tenants. Reserve HNSW for the largest partitions where recall-latency trade-offs justify the memory cost.
- Plan for connection pooling -- PgBouncer (or equivalent) is not optional at scale. Fifty API pods opening direct connections will exhaust PostgreSQL.
Building for scale isn't about handling today's load -- it's about handling tomorrow's without rewriting everything. Our adoption of techniques like hybrid search plays a key role in that scalability.
Frequently Asked Questions
When should you plan for scale?
Plan for scale from the start if you expect significant growth. Retrofitting partitioning is painful — Chatsy's architecture uses tenant-partitioned indexes from day one, which keeps vector search at ~50ms regardless of document count. If you're building a RAG system that could grow from hundreds to millions of documents, design for partitioning, background jobs, and caching from the beginning.
What are the biggest scaling challenges for RAG systems?
The three main bottlenecks are document ingestion (CPU and memory intensive — a single 100-page PDF can take 30+ seconds), vector search latency (which degrades from ~20ms at 10K docs to ~2000ms at 10M docs without partitioning), and AI generation cost and latency (LLM calls are expensive and slow). Each requires different solutions: background jobs for ingestion, partitioned indexes for search, and caching plus model routing for generation.
What infrastructure is required for scale?
Chatsy's scaled architecture uses load-balanced API pods, PostgreSQL with pgvector for partitioned vector storage, Redis for response caching, and Trigger.dev for background job processing. The key is horizontal scaling of workers for ingestion (50 docs/sec), partitioned tables so queries only scan relevant tenant data, and Redis to cache LLM responses and avoid redundant API calls.
What are the cost implications of scaling?
Proper scaling can dramatically reduce costs. Chatsy cut monthly infrastructure from $5,000 to $800 by implementing response caching, streaming, and smart model routing — reducing API response time from 8s to 2s. Response caching reuses results for identical questions; model routing uses cheaper models (e.g., GPT-4o-mini) for simple queries. The trade-off is upfront engineering investment for long-term savings.
How do you monitor systems at scale?
You can't optimize what you can't measure. Monitor ingestion throughput (docs/sec), search latency (P99), API response time, and infrastructure costs. Chatsy's results show the value: tracking these metrics revealed the ingestion bottleneck (1 doc/sec → 50 docs/sec with background jobs), search latency spike (2000ms → 80ms with partitioning), and API cost (8s → 2s with caching and routing).