Get 20% Lifetime Off on all plans
Back to Blog
Engineering

Building for Scale: How We Handle Millions of Documents

The architecture and optimizations that allow our platform to scale seamlessly from startup to enterprise.

Sarah Rodriguez
CTO
November 28, 2024
4 min read
Share:
Building for Scale: How We Handle Millions of Documents

When we launched Chatsy, our first customer had 50 documents. Today, our largest enterprise customer has over 2 million. Here's how we built a system that scales seamlessly across that range.

The Challenge

Scaling a RAG (Retrieval-Augmented Generation) system is uniquely challenging because you're scaling three things simultaneously:

  1. Document ingestion — Processing and embedding documents
  2. Vector search — Finding relevant content quickly
  3. AI generation — Producing high-quality responses

Each has different scaling characteristics and bottlenecks.

Architecture Overview

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│                    Load Balancer                     │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                          │
        ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
        ā–¼                 ā–¼                 ā–¼
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”  ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”  ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│   API Pod    │  │   API Pod    │  │   API Pod    │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜  ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜  ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
        │                 │                 │
        ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                          │
        ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
        ā–¼                 ā–¼                 ā–¼
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”  ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”  ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│  PostgreSQL  │  │    Redis     │  │   Trigger    │
│  + pgvector  │  │    Cache     │  │    .dev      │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜  ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜  ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Scaling Document Ingestion

The Problem

Document processing is CPU and memory intensive:

  • PDF parsing
  • Text extraction
  • Chunking
  • Embedding generation

A single 100-page PDF can take 30+ seconds to process.

Our Solution: Background Jobs with Trigger.dev

We use Trigger.dev for reliable, scalable background processing:

typescript
export const processDocument = task({ id: "process-document", retry: { maxAttempts: 3 }, run: async ({ documentId }) => { // 1. Extract text const text = await extractText(documentId); // 2. Chunk intelligently const chunks = await chunkText(text, { maxTokens: 512, overlap: 50, }); // 3. Generate embeddings in batches for (const batch of chunk(chunks, 100)) { const embeddings = await openai.embeddings.create({ model: "text-embedding-3-small", input: batch.map(c => c.text), }); // 4. Store in database await prisma.chunk.createMany({ data: batch.map((chunk, i) => ({ documentId, content: chunk.text, embedding: embeddings.data[i].embedding, })), }); } }, });

This approach gives us:

  • Horizontal scaling: Add more workers as needed
  • Reliability: Automatic retries on failure
  • Visibility: Track progress in real-time

Scaling Vector Search

The Problem

As document count grows, search latency increases:

  • 10K docs: ~20ms
  • 100K docs: ~50ms
  • 1M docs: ~200ms
  • 10M docs: ~2000ms (unacceptable!)

Our Solution: Partitioned Indexes

We partition vectors by tenant (chatbot):

sql
-- Create partitioned table CREATE TABLE chunks ( id UUID PRIMARY KEY, chatbot_id UUID NOT NULL, content TEXT, embedding vector(1536) ) PARTITION BY HASH (chatbot_id); -- Create partitions CREATE TABLE chunks_p0 PARTITION OF chunks FOR VALUES WITH (MODULUS 16, REMAINDER 0); -- ... repeat for 16 partitions -- Index each partition CREATE INDEX ON chunks_p0 USING ivfflat (embedding vector_cosine_ops);

Now queries only scan relevant partitions:

sql
SELECT content, embedding <=> $1 AS distance FROM chunks WHERE chatbot_id = $2 -- Only scans relevant partition ORDER BY distance LIMIT 10;

Result: Consistent ~50ms queries regardless of total document count.

Scaling AI Generation

The Problem

LLM API calls are:

  • Expensive (~$0.01-0.10 per request)
  • Slow (1-10 seconds)
  • Rate-limited

Our Solutions

1. Response Caching

typescript
const cacheKey = hash(question + contextChunks); const cached = await redis.get(cacheKey); if (cached) return cached; const response = await generateResponse(question, contextChunks); await redis.setex(cacheKey, 3600, response); return response;

2. Streaming Responses Don't wait for the full response — stream tokens to the client:

typescript
const stream = await openai.chat.completions.create({ model: "gpt-4o", messages, stream: true, }); for await (const chunk of stream) { controller.enqueue(chunk.choices[0]?.delta?.content || ""); }

3. Model Routing Use cheaper models for simple questions:

typescript
const complexity = await classifyComplexity(question); const model = complexity === "simple" ? "gpt-4o-mini" : "gpt-4o";

Results

MetricBeforeAfter
Ingestion Speed1 doc/sec50 docs/sec
Search Latency (P99)2000ms80ms
API Response Time8s2s
Monthly Infrastructure$5,000$800

Key Takeaways

  1. Partition early — Retrofitting partitioning is painful
  2. Background everything — Never block the user on heavy operations
  3. Cache aggressively — LLM calls are expensive; reuse results
  4. Stream responses — Perceived latency matters as much as actual latency
  5. Monitor everything — You can't optimize what you can't measure

Building for scale isn't about handling today's load — it's about handling tomorrow's without rewriting everything.

See It In Action →

Tags:#architecture#scale#performance#engineering

Related Articles

Ready to try Chatsy?

Build your own AI customer support agent in minutes.

Start Free Trial