Chatsy

Building for Scale: Handling Millions of Docs

How we scaled our RAG system from 50 to 2 million documents using pgvector partitioning, background jobs, and response caching — cutting costs 84%.

Chatsy Team
Engineering
November 28, 2024Updated: January 15, 2026
14 min read
Share:
Featured image for article: Building for Scale: Handling Millions of Docs - Engineering guide by Chatsy Team

When we launched Chatsy, our first customer had 50 documents. Today, our largest enterprise customer has over 2 million. Here's how we built a system that scales seamlessly across that range.

TL;DR:

  • Scaling a RAG system means solving three bottlenecks simultaneously: document ingestion, vector search, and AI generation.
  • Partitioned indexes by tenant keep vector search at a consistent ~50ms regardless of total document count, and background job processing handles ingestion at 50 docs/sec.
  • Response caching, streaming, and smart model routing cut API response time from 8s to 2s and infrastructure costs from $5,000 to $800/month.
  • Key lesson: partition early, background everything, cache aggressively, and stream responses to minimize perceived latency.

The Challenge

Scaling a RAG (Retrieval-Augmented Generation) system is uniquely challenging because you're scaling three things simultaneously:

  1. Document ingestion — Processing and embedding documents
  2. Vector search — Finding relevant content quickly
  3. AI generation — Producing high-quality responses

Each has different scaling characteristics and bottlenecks. A naive implementation that works at 1,000 documents will fall apart at 100,000. And what works at 100K will grind to a halt at 2M. The strategies you need shift at each order of magnitude.

Architecture Overview

Our production architecture separates concerns into three layers: request handling, data persistence, and background processing. Each layer scales independently.

┌─────────────────────────────────────────────────────┐
│                    Load Balancer                     │
│              (health checks, SSL term)              │
└─────────────────────────────────────────────────────┘
                          │
        ┌─────────────────┼─────────────────┐
        ▼                 ▼                 ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│   API Pod    │  │   API Pod    │  │   API Pod    │
│  (stateless) │  │  (stateless) │  │  (stateless) │
└──────────────┘  └──────────────┘  └──────────────┘
        │                 │                 │
        └─────────────────┼─────────────────┘
                          │
        ┌─────────────────┼─────────────────┐
        ▼                 ▼                 ▼
┌──────────────┐  ┌──────────────┐  ┌──────────────┐
│  PostgreSQL  │  │    Redis     │  │   Trigger    │
│  + pgvector  │  │    Cache     │  │    .dev      │
└──────────────┘  └──────────────┘  └──────────────┘

Why This Topology Works

  • API pods are stateless. Any pod can handle any request. Scaling horizontally means adding pods behind the load balancer.
  • PostgreSQL with pgvector handles both relational data and vector embeddings in one database, eliminating the need to synchronize a separate vector store.
  • Redis sits between the API and both PostgreSQL and OpenAI, caching embeddings, search results, and generated responses.
  • Trigger.dev manages all background work: document processing, batch embedding, scheduled cleanup, and webhook delivery.

The key architectural decision was keeping everything in PostgreSQL rather than running a separate vector database. This simplifies operations, reduces infrastructure cost, and lets us use standard database tooling for backups, migrations, and monitoring. Our migration from Pinecone to pgvector covers the reasoning in depth.

Scaling Document Ingestion

The Ingestion Bottleneck

Document processing is CPU and memory intensive:

  • PDF parsing and text extraction
  • Content cleaning and normalization
  • Intelligent chunking with overlap
  • Embedding generation (API calls)
  • Database writes with vector indexing

A single 100-page PDF can take 30+ seconds to process. When a customer uploads 500 documents at once, that's over 4 hours of sequential processing. Users expect their content to be searchable in minutes, not hours.

Our Solution: Background Jobs with Trigger.dev

We use Trigger.dev for reliable, scalable background processing:

typescript
export const processDocument = task({ id: "process-document", retry: { maxAttempts: 3 }, run: async ({ documentId }) => { // 1. Extract text const text = await extractText(documentId); // 2. Chunk intelligently const chunks = await chunkText(text, { maxTokens: 512, overlap: 50, }); // 3. Generate embeddings in batches for (const batch of chunk(chunks, 100)) { const embeddings = await openai.embeddings.create({ model: "text-embedding-3-small", input: batch.map(c => c.text), }); // 4. Store in database await prisma.chunk.createMany({ data: batch.map((chunk, i) => ({ documentId, content: chunk.text, embedding: embeddings.data[i].embedding, })), }); } }, });

This approach gives us:

  • Horizontal scaling: Add more workers as needed
  • Reliability: Automatic retries on failure
  • Visibility: Track progress in real-time

Batch Processing for Bulk Uploads

When customers import hundreds of documents at once, we use a batch orchestration task that fans out work across multiple workers:

typescript
export const batchImport = task({ id: "batch-import", run: async ({ chatbotId, documentIds }) => { // Process in parallel batches of 10 const batches = chunk(documentIds, 10); for (const batch of batches) { await Promise.all( batch.map((docId) => processDocument.triggerAndWait({ documentId: docId }) ) ); // Update progress for the UI await updateImportProgress(chatbotId, { processed: batch.length, total: documentIds.length, }); } }, });

At 10 concurrent workers, a 500-document upload completes in roughly 25 minutes instead of 4 hours. The progress updates feed into a real-time UI so customers can see exactly where their import stands.

Queue Management and Backpressure

At scale, you need to handle bursts without overwhelming downstream services. Our queue strategy uses three priority levels:

  1. High priority -- Real-time document updates (customer edits an article, expects immediate re-indexing)
  2. Normal priority -- Standard uploads and imports
  3. Low priority -- Scheduled re-indexing, cleanup tasks, and bulk migrations

When the embedding API rate limit approaches, we apply backpressure by slowing the dequeue rate rather than failing jobs. Trigger.dev's built-in concurrency controls handle this cleanly.

The Search Latency Problem

As document count grows, search latency increases. Without partitioning, on a single flat index:

  • 10K chunks: ~20ms
  • 100K chunks: ~50ms
  • 1M chunks: ~200ms
  • 10M chunks: ~2000ms (unacceptable for real-time chat)

The reason is straightforward: IVFFlat indexes scan a fixed percentage of clusters. More vectors means more data to scan per query.

Our Solution: Partitioned Indexes

We partition vectors by tenant (chatbot), a strategy refined during our migration from Pinecone to pgvector:

sql
-- Create partitioned table CREATE TABLE chunks ( id UUID PRIMARY KEY, chatbot_id UUID NOT NULL, content TEXT, embedding vector(1536) ) PARTITION BY HASH (chatbot_id); -- Create partitions CREATE TABLE chunks_p0 PARTITION OF chunks FOR VALUES WITH (MODULUS 16, REMAINDER 0); -- ... repeat for 16 partitions -- Index each partition CREATE INDEX ON chunks_p0 USING ivfflat (embedding vector_cosine_ops);

Now queries only scan relevant partitions:

sql
SELECT content, embedding <=> $1 AS distance FROM chunks WHERE chatbot_id = $2 -- Only scans relevant partition ORDER BY distance LIMIT 10;

Result: Consistent ~50ms queries regardless of total document count.

Database Sharding Strategy

At 2M+ documents, even partitioned indexes benefit from careful tuning. Our sharding approach follows three principles:

  1. Tenant isolation via hash partitioning. Each chatbot's data lands in one of 16 hash partitions. This prevents large tenants from degrading search for small ones.
  2. Index tuning per partition. IVFFlat lists parameter scales with the number of vectors in each partition. We run a nightly job that checks partition sizes and rebuilds indexes when the vector count crosses a threshold (e.g., 100K vectors triggers an increase from 100 to 250 lists).
  3. Connection pooling. PgBouncer sits between the API pods and PostgreSQL, keeping the connection count manageable even under high concurrency. Without it, 50 concurrent API pods would exhaust PostgreSQL's max connections.
┌──────────┐     ┌────────────┐     ┌────────────────┐
│ API Pods │ ──▶ │ PgBouncer  │ ──▶ │  PostgreSQL    │
│ (50+)    │     │ (pool: 20) │     │  (max_conn: 50)│
└──────────┘     └────────────┘     └────────────────┘

HNSW vs IVFFlat: When to Switch

We started with IVFFlat because it's faster to build and works well at moderate scale. As partition sizes grow, HNSW becomes attractive for its better recall-latency trade-off:

FactorIVFFlatHNSW
Build timeFast (minutes)Slow (hours at 1M+)
Query latencyGood with tuningConsistently fast
Memory usageLowerHigher (~2x)
Insert performanceFastModerate
Best for< 500K vectors/partition> 500K vectors/partition

For most of our tenants, IVFFlat is the right choice. We reserve HNSW for enterprise customers with 500K+ chunks where the marginal query latency improvement justifies the memory cost.

The Caching Layer

Caching is the single biggest cost reducer in our stack. We cache at three levels:

Level 1: Embedding Cache (Redis)

Generating embeddings costs money and adds latency. When the same text chunk is re-indexed (e.g., a document is re-imported without changes), we skip the API call:

typescript
async function getEmbedding(text: string): Promise<number[]> { const cacheKey = `emb:${hash(text)}`; const cached = await redis.get(cacheKey); if (cached) return JSON.parse(cached); const result = await openai.embeddings.create({ model: "text-embedding-3-small", input: text, }); const embedding = result.data[0].embedding; await redis.setex(cacheKey, 86400 * 7, JSON.stringify(embedding)); // 7 days return embedding; }

This alone cut our embedding API costs by 35% because many customers re-import documents during testing and setup.

Level 2: Search Result Cache (Redis)

Identical queries to the same chatbot often return the same results. We cache the retrieved chunks for a short TTL:

typescript
const searchCacheKey = `search:${chatbotId}:${hash(queryEmbedding)}`; const cachedResults = await redis.get(searchCacheKey); if (cachedResults) return JSON.parse(cachedResults); const results = await vectorSearch(chatbotId, queryEmbedding); await redis.setex(searchCacheKey, 300, JSON.stringify(results)); // 5 min

The short TTL (5 minutes) ensures freshness while still catching the common pattern of multiple users asking the same question within a short window.

Level 3: Response Cache (Redis)

The most impactful cache. Full LLM responses for identical question + context combinations:

typescript
const responseCacheKey = `resp:${hash(question + contextChunks)}`; const cached = await redis.get(responseCacheKey); if (cached) return cached; const response = await generateResponse(question, contextChunks); await redis.setex(responseCacheKey, 3600, response); // 1 hour return response;

Hit rates vary by use case: FAQ-heavy chatbots see 40-60% cache hits. Support bots with unique account-specific questions see 10-15%. Even 10% saves meaningful cost at scale.

Scaling AI Generation

The Generation Cost Problem

LLM API calls are:

  • Expensive (~$0.01-0.10 per request)
  • Slow (1-10 seconds)
  • Rate-limited

Our Solutions

1. Streaming Responses Don't wait for the full response -- stream tokens to the client:

typescript
const stream = await openai.chat.completions.create({ model: "gpt-4o", messages, stream: true, }); for await (const chunk of stream) { controller.enqueue(chunk.choices[0]?.delta?.content || ""); }

Streaming cuts perceived latency dramatically. The user sees the first token in ~300ms instead of waiting 2-5 seconds for the complete response. This makes a bigger difference to user satisfaction than raw speed improvements.

2. Model Routing Use cheaper models for simple questions:

typescript
const complexity = await classifyComplexity(question); const model = complexity === "simple" ? "gpt-4o-mini" : "gpt-4o";

In practice, 60-70% of support questions are straightforward enough for GPT-4o-mini. Routing saves roughly 50% on LLM costs compared to using GPT-4o for everything.

3. Rate Limit Management At scale, you'll hit OpenAI rate limits. We handle this with a token bucket implementation that queues requests when approaching limits and falls back to a secondary model provider if the primary is throttled.

Results

MetricBeforeAfter
Ingestion Speed1 doc/sec50 docs/sec
Search Latency (P99)2000ms80ms
API Response Time8s2s
Monthly Infrastructure$5,000$800

Monitoring and Alerting

You cannot optimize what you cannot measure. Our monitoring setup tracks four categories of metrics:

Infrastructure Metrics

MetricToolAlert Threshold
API response time (P50, P95, P99)Application logsP99 > 3s
Database connection pool usagePgBouncer stats> 80% utilization
Redis memory usageRedis INFO> 75% of max memory
Background job queue depthTrigger.dev dashboard> 1,000 pending jobs

Business Metrics

MetricWhat It Tells You
Docs processed per minuteIngestion pipeline health
Search latency by tenantWhether a large tenant is degrading performance
Cache hit rate by levelWhether your caching strategy is effective
LLM cost per conversationWhether model routing is working

Alerting Strategy

We use a tiered alerting approach to avoid fatigue:

  • P1 (page immediately): Search latency P99 > 3s, API error rate > 5%, database connection exhaustion
  • P2 (Slack alert, respond in 1 hour): Ingestion queue depth > 5,000, cache hit rate drops below 20%, LLM error rate > 2%
  • P3 (daily review): Cost anomalies, slow partition index builds, elevated retry rates

The single most useful alert we added was on per-tenant search latency. When a customer imports a massive document set, their partition's index can become under-tuned. The alert catches this before the customer notices.

Real Benchmarks

Here is what our production numbers look like across different scale tiers:

Scale TierDocsChunksAvg Search (P50)Search (P99)Ingestion Rate
Starter< 1K< 10K8ms25ms5 docs/sec
Growth1K-50K10K-500K15ms45ms20 docs/sec
Scale50K-500K500K-5M25ms65ms50 docs/sec
Enterprise500K-2M+5M-20M+35ms80ms50 docs/sec

These numbers hold because each tenant's vectors are isolated in their own partition. A single enterprise tenant with 2M documents does not affect the search latency of a starter tenant with 500 documents.

Key Takeaways

  1. Partition early -- Retrofitting partitioning is painful. Design for tenant isolation from day one.
  2. Background everything -- Never block the user on heavy operations. Document processing, embedding generation, and index rebuilds all belong in background jobs.
  3. Cache at every layer -- Embedding cache, search cache, and response cache each target a different bottleneck. Combined, they cut costs by 60-70%.
  4. Stream responses -- Perceived latency matters as much as actual latency. First-token time of 300ms feels instant even if the full response takes 3 seconds.
  5. Monitor per-tenant -- Aggregate metrics hide problems. A single large tenant can degrade their own experience without affecting averages.
  6. Right-size your indexes -- IVFFlat works for most tenants. Reserve HNSW for the largest partitions where recall-latency trade-offs justify the memory cost.
  7. Plan for connection pooling -- PgBouncer (or equivalent) is not optional at scale. Fifty API pods opening direct connections will exhaust PostgreSQL.

Building for scale isn't about handling today's load -- it's about handling tomorrow's without rewriting everything. Our adoption of techniques like hybrid search plays a key role in that scalability.

See It In Action →


Frequently Asked Questions

When should you plan for scale?

Plan for scale from the start if you expect significant growth. Retrofitting partitioning is painful — Chatsy's architecture uses tenant-partitioned indexes from day one, which keeps vector search at ~50ms regardless of document count. If you're building a RAG system that could grow from hundreds to millions of documents, design for partitioning, background jobs, and caching from the beginning.

What are the biggest scaling challenges for RAG systems?

The three main bottlenecks are document ingestion (CPU and memory intensive — a single 100-page PDF can take 30+ seconds), vector search latency (which degrades from ~20ms at 10K docs to ~2000ms at 10M docs without partitioning), and AI generation cost and latency (LLM calls are expensive and slow). Each requires different solutions: background jobs for ingestion, partitioned indexes for search, and caching plus model routing for generation.

What infrastructure is required for scale?

Chatsy's scaled architecture uses load-balanced API pods, PostgreSQL with pgvector for partitioned vector storage, Redis for response caching, and Trigger.dev for background job processing. The key is horizontal scaling of workers for ingestion (50 docs/sec), partitioned tables so queries only scan relevant tenant data, and Redis to cache LLM responses and avoid redundant API calls.

What are the cost implications of scaling?

Proper scaling can dramatically reduce costs. Chatsy cut monthly infrastructure from $5,000 to $800 by implementing response caching, streaming, and smart model routing — reducing API response time from 8s to 2s. Response caching reuses results for identical questions; model routing uses cheaper models (e.g., GPT-4o-mini) for simple queries. The trade-off is upfront engineering investment for long-term savings.

How do you monitor systems at scale?

You can't optimize what you can't measure. Monitor ingestion throughput (docs/sec), search latency (P99), API response time, and infrastructure costs. Chatsy's results show the value: tracking these metrics revealed the ingestion bottleneck (1 doc/sec → 50 docs/sec with background jobs), search latency spike (2000ms → 80ms with partitioning), and API cost (8s → 2s with caching and routing).


#architecture#scale#performance#engineering
Related

Related Articles

Ready to try Chatsy?

Build your own AI customer support agent in minutes — no code required.

Start Free Trial