When we launched Chatsy, our first customer had 50 documents. Today, our largest enterprise customer has over 2 million. Here's how we built a system that scales seamlessly across that range.

The Challenge

Scaling a RAG (Retrieval-Augmented Generation) system is uniquely challenging because you're scaling three things simultaneously:

Document ingestion — Processing and embedding documents
Vector search — Finding relevant content quickly
AI generation — Producing high-quality responses

Each has different scaling characteristics and bottlenecks.

Architecture Overview

┌─────────────────────────────────────────────────────┐
│                    Load Balancer                     │
└─────────────────────────────────────────────────────┘
                          │
        ┌─────────────────┼─────────────────┐
        ▼                 ▼                 ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   API Pod    │  │   API Pod    │  │   API Pod    │
└──────────────┘  └──────────────┘  └──────────────┘
        │                 │                 │
        └─────────────────┼─────────────────┘
                          │
        ┌─────────────────┼─────────────────┐
        ▼                 ▼                 ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  PostgreSQL  │  │    Redis     │  │   Trigger    │
│  + pgvector  │  │    Cache     │  │    .dev      │
└──────────────┘  └──────────────┘  └──────────────┘

Scaling Document Ingestion

The Problem

Document processing is CPU and memory intensive:

PDF parsing
Text extraction
Chunking
Embedding generation

A single 100-page PDF can take 30+ seconds to process.

Our Solution: Background Jobs with Trigger.dev

We use Trigger.dev for reliable, scalable background processing:

typescript
export const processDocument = task({
  id: "process-document",
  retry: { maxAttempts: 3 },
  run: async ({ documentId }) => {
    // 1. Extract text
    const text = await extractText(documentId);
    
    // 2. Chunk intelligently
    const chunks = await chunkText(text, {
      maxTokens: 512,
      overlap: 50,
    });
    
    // 3. Generate embeddings in batches
    for (const batch of chunk(chunks, 100)) {
      const embeddings = await openai.embeddings.create({
        model: "text-embedding-3-small",
        input: batch.map(c => c.text),
      });
      
      // 4. Store in database
      await prisma.chunk.createMany({
        data: batch.map((chunk, i) => ({
          documentId,
          content: chunk.text,
          embedding: embeddings.data[i].embedding,
        })),
      });
    }
  },
});

This approach gives us:

Horizontal scaling: Add more workers as needed
Reliability: Automatic retries on failure
Visibility: Track progress in real-time

Scaling Vector Search

The Problem

As document count grows, search latency increases:

10K docs: ~20ms
100K docs: ~50ms
1M docs: ~200ms
10M docs: ~2000ms (unacceptable!)

Our Solution: Partitioned Indexes

We partition vectors by tenant (chatbot):

sql
-- Create partitioned table
CREATE TABLE chunks (
    id UUID PRIMARY KEY,
    chatbot_id UUID NOT NULL,
    content TEXT,
    embedding vector(1536)
) PARTITION BY HASH (chatbot_id);

-- Create partitions
CREATE TABLE chunks_p0 PARTITION OF chunks
    FOR VALUES WITH (MODULUS 16, REMAINDER 0);
-- ... repeat for 16 partitions

-- Index each partition
CREATE INDEX ON chunks_p0 
    USING ivfflat (embedding vector_cosine_ops);

Now queries only scan relevant partitions:

sql
SELECT content, embedding <=> $1 AS distance
FROM chunks
WHERE chatbot_id = $2  -- Only scans relevant partition
ORDER BY distance
LIMIT 10;

Result: Consistent ~50ms queries regardless of total document count.

Scaling AI Generation

The Problem

LLM API calls are:

Expensive (~$0.01-0.10 per request)
Slow (1-10 seconds)
Rate-limited

Our Solutions

1. Response Caching

typescript
const cacheKey = hash(question + contextChunks);
const cached = await redis.get(cacheKey);
if (cached) return cached;

const response = await generateResponse(question, contextChunks);
await redis.setex(cacheKey, 3600, response);
return response;

2. Streaming Responses Don't wait for the full response — stream tokens to the client:

typescript
const stream = await openai.chat.completions.create({
  model: "gpt-4o",
  messages,
  stream: true,
});

for await (const chunk of stream) {
  controller.enqueue(chunk.choices[0]?.delta?.content || "");
}

3. Model Routing Use cheaper models for simple questions:

typescript
const complexity = await classifyComplexity(question);
const model = complexity === "simple" ? "gpt-4o-mini" : "gpt-4o";

Results

Metric	Before	After
Ingestion Speed	1 doc/sec	50 docs/sec
Search Latency (P99)	2000ms	80ms
API Response Time	8s	2s
Monthly Infrastructure	$5,000	$800

Key Takeaways

Partition early — Retrofitting partitioning is painful
Background everything — Never block the user on heavy operations
Cache aggressively — LLM calls are expensive; reuse results
Stream responses — Perceived latency matters as much as actual latency
Monitor everything — You can't optimize what you can't measure

Building for scale isn't about handling today's load — it's about handling tomorrow's without rewriting everything.

See It In Action →

Building for Scale: How We Handle Millions of Documents

The Challenge

Architecture Overview

Scaling Document Ingestion

The Problem

Our Solution: Background Jobs with Trigger.dev

Scaling Vector Search

The Problem

Our Solution: Partitioned Indexes

Scaling AI Generation

The Problem

Our Solutions

Results

Key Takeaways

Related Articles

Why We Migrated from Pinecone to pgvector: A 97% Cost Reduction Story

12 AI Chatbot Metrics You Should Track (And Why)

RAG vs Fine-Tuning: Which is Right for Your AI Chatbot?

Ready to try Chatsy?