A client came to us with a straightforward-sounding requirement: ingest a corpus of 15,000 internal policy documents, make them searchable via natural language, and surface relevant excerpts inside their existing web portal. Three weeks later, we had a pipeline processing 500+ documents per day with 99.5% uptime. Here's exactly how we built it.
The Problem With Naive RAG
The first instinct with any retrieval-augmented generation (RAG) system is to chunk documents, embed them, and throw them into a vector store. We've seen this approach fail in production more times than we can count. The issues are predictable: chunks that split mid-sentence lose context, embeddings from different document types cluster poorly, and without a queue layer, a burst of uploads will overwhelm your embedding API rate limits.
For this client, documents ranged from 2-page memos to 200-page compliance manuals. A fixed chunk size of 512 tokens would shred the manuals into meaningless fragments while leaving the memos as single oversized chunks. We needed a smarter approach.
Architecture Overview
S3 Upload → SQS Queue → Lambda Processor
→ Document Parser (Textract / pdfplumber)
→ Semantic Chunker
→ OpenAI Embeddings (with retry + backoff)
→ Pinecone Upsert
→ PostgreSQL Metadata Store
Query API → Pinecone Search → GPT-4o Synthesis → Response
Semantic Chunking Over Fixed Windows
Instead of splitting by token count, we used a sentence-boundary chunker that respects paragraph structure. The algorithm works in two passes: first, split on paragraph breaks and sentence boundaries; second, merge adjacent chunks until they approach a target size (we used 400 tokens), but never merge across section headings.
This produced chunks that were semantically coherent — a chunk about "termination clauses" stayed together rather than being split across two embedding vectors. Retrieval precision improved noticeably in our internal evals compared to the fixed-window baseline.
We also stored the surrounding context (the previous and next chunk IDs) in PostgreSQL so the query layer could fetch adjacent chunks when a retrieved chunk appeared to be mid-thought.
Queue Architecture: Why SQS Saved Us
The client's document upload pattern was bursty — nothing for hours, then 200 documents uploaded at once after a policy review meeting. Without a queue, this would have hammered the OpenAI embeddings API and triggered rate limit errors.
We put SQS between the S3 upload trigger and the Lambda processor. Each message represents one document. Lambda concurrency was capped at 10 to stay within OpenAI's tier-1 rate limits (3,000 RPM for text-embedding-3-small). Dead-letter queue catches anything that fails after 3 retries, and a CloudWatch alarm pages us if the DLQ depth exceeds 5.
The result: a burst of 200 uploads queues cleanly and processes over ~20 minutes without a single rate limit error. The client sees documents appear in search within 30 minutes of upload, which met their SLA.
Handling PDF Extraction Failures
About 8% of the document corpus was scanned PDFs — images of text, not actual text layers. AWS Textract handles these, but it's slower (5–30 seconds per page) and costs more. We added a pre-processing step that checks for a text layer using pdfplumber; if the extracted text is below a character-per-page threshold, we route to Textract instead.
Textract jobs are asynchronous, so we used a separate SQS queue with a polling Lambda that checks job status every 30 seconds. Once complete, the extracted text flows into the same chunking pipeline as native PDFs.
The Query Layer
The query API is a Next.js route handler that takes a natural language question, embeds it using the same model as the documents (text-embedding-3-small), queries Pinecone for the top-8 chunks, fetches adjacent context chunks from PostgreSQL where needed, and passes the assembled context to GPT-4o with a system prompt that instructs it to cite the source document and page number.
We added a confidence filter: if the top Pinecone result has a cosine similarity below 0.72, we return a "no relevant documents found" response rather than hallucinating an answer from weak context. This threshold was tuned against a test set of 50 known-good and 50 known-bad queries.
Observability
Every pipeline stage emits structured logs to CloudWatch with a correlation ID that ties together the S3 upload event, SQS message, Lambda invocation, Pinecone upsert, and PostgreSQL write. When something fails, we can trace the exact document through every stage in under a minute.
We track three key metrics in a Datadog dashboard: pipeline throughput (documents/hour), embedding latency (p50/p95), and query latency (p50/p95). Alerts fire if p95 query latency exceeds 4 seconds or if the DLQ depth rises above 5.
What We'd Do Differently
The main thing we'd change is the embedding model selection process. We went with text-embedding-3-small for cost reasons, but for a legal/compliance document corpus, text-embedding-3-large's higher dimensionality (3072 vs 1536) would likely improve retrieval precision on nuanced queries. The cost difference at 500 documents/day is about $15/month — worth it in hindsight.
We'd also invest earlier in an evaluation harness. We built our test set of 100 queries manually in week 3. Having it in week 1 would have let us make better chunking decisions earlier.
Building something similar? We've done this across multiple industries and document types.