Building for Scale: How We Handle Millions of Documents
The architecture and optimizations that allow our platform to scale seamlessly from startup to enterprise.

When we launched Chatsy, our first customer had 50 documents. Today, our largest enterprise customer has over 2 million. Here's how we built a system that scales seamlessly across that range.
The Challenge
Scaling a RAG (Retrieval-Augmented Generation) system is uniquely challenging because you're scaling three things simultaneously:
- Document ingestion ā Processing and embedding documents
- Vector search ā Finding relevant content quickly
- AI generation ā Producing high-quality responses
Each has different scaling characteristics and bottlenecks.
Architecture Overview
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā Load Balancer ā
āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
āāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāā
ā¼ ā¼ ā¼
āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā
ā API Pod ā ā API Pod ā ā API Pod ā
āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā
ā ā ā
āāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāā
ā
āāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāā
ā¼ ā¼ ā¼
āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā
ā PostgreSQL ā ā Redis ā ā Trigger ā
ā + pgvector ā ā Cache ā ā .dev ā
āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā āāāāāāāāāāāāāāāā
Scaling Document Ingestion
The Problem
Document processing is CPU and memory intensive:
- PDF parsing
- Text extraction
- Chunking
- Embedding generation
A single 100-page PDF can take 30+ seconds to process.
Our Solution: Background Jobs with Trigger.dev
We use Trigger.dev for reliable, scalable background processing:
typescriptexport const processDocument = task({ id: "process-document", retry: { maxAttempts: 3 }, run: async ({ documentId }) => { // 1. Extract text const text = await extractText(documentId); // 2. Chunk intelligently const chunks = await chunkText(text, { maxTokens: 512, overlap: 50, }); // 3. Generate embeddings in batches for (const batch of chunk(chunks, 100)) { const embeddings = await openai.embeddings.create({ model: "text-embedding-3-small", input: batch.map(c => c.text), }); // 4. Store in database await prisma.chunk.createMany({ data: batch.map((chunk, i) => ({ documentId, content: chunk.text, embedding: embeddings.data[i].embedding, })), }); } }, });
This approach gives us:
- Horizontal scaling: Add more workers as needed
- Reliability: Automatic retries on failure
- Visibility: Track progress in real-time
Scaling Vector Search
The Problem
As document count grows, search latency increases:
- 10K docs: ~20ms
- 100K docs: ~50ms
- 1M docs: ~200ms
- 10M docs: ~2000ms (unacceptable!)
Our Solution: Partitioned Indexes
We partition vectors by tenant (chatbot):
sql-- Create partitioned table CREATE TABLE chunks ( id UUID PRIMARY KEY, chatbot_id UUID NOT NULL, content TEXT, embedding vector(1536) ) PARTITION BY HASH (chatbot_id); -- Create partitions CREATE TABLE chunks_p0 PARTITION OF chunks FOR VALUES WITH (MODULUS 16, REMAINDER 0); -- ... repeat for 16 partitions -- Index each partition CREATE INDEX ON chunks_p0 USING ivfflat (embedding vector_cosine_ops);
Now queries only scan relevant partitions:
sqlSELECT content, embedding <=> $1 AS distance FROM chunks WHERE chatbot_id = $2 -- Only scans relevant partition ORDER BY distance LIMIT 10;
Result: Consistent ~50ms queries regardless of total document count.
Scaling AI Generation
The Problem
LLM API calls are:
- Expensive (~$0.01-0.10 per request)
- Slow (1-10 seconds)
- Rate-limited
Our Solutions
1. Response Caching
typescriptconst cacheKey = hash(question + contextChunks); const cached = await redis.get(cacheKey); if (cached) return cached; const response = await generateResponse(question, contextChunks); await redis.setex(cacheKey, 3600, response); return response;
2. Streaming Responses Don't wait for the full response ā stream tokens to the client:
typescriptconst stream = await openai.chat.completions.create({ model: "gpt-4o", messages, stream: true, }); for await (const chunk of stream) { controller.enqueue(chunk.choices[0]?.delta?.content || ""); }
3. Model Routing Use cheaper models for simple questions:
typescriptconst complexity = await classifyComplexity(question); const model = complexity === "simple" ? "gpt-4o-mini" : "gpt-4o";
Results
| Metric | Before | After |
|---|---|---|
| Ingestion Speed | 1 doc/sec | 50 docs/sec |
| Search Latency (P99) | 2000ms | 80ms |
| API Response Time | 8s | 2s |
| Monthly Infrastructure | $5,000 | $800 |
Key Takeaways
- Partition early ā Retrofitting partitioning is painful
- Background everything ā Never block the user on heavy operations
- Cache aggressively ā LLM calls are expensive; reuse results
- Stream responses ā Perceived latency matters as much as actual latency
- Monitor everything ā You can't optimize what you can't measure
Building for scale isn't about handling today's load ā it's about handling tomorrow's without rewriting everything.
Related Articles
Why We Migrated from Pinecone to pgvector: A 97% Cost Reduction Story
How we achieved massive cost savings while improving performance by moving to PostgreSQL with pgvector extension.
12 AI Chatbot Metrics You Should Track (And Why)
Measure what matters. Learn which chatbot KPIs actually indicate success and how to build a dashboard that drives improvement.
RAG vs Fine-Tuning: Which is Right for Your AI Chatbot?
Should you use Retrieval-Augmented Generation or fine-tune a model for your chatbot? We break down the pros, cons, and best use cases for each approach.
Ready to try Chatsy?
Build your own AI customer support agent in minutes.
Start Free Trial