Everyone's building RAG. Most of it doesn't work past the demo. After shipping Kontrakku — a contract review product running on a production RAG pipeline processing thousands of legal documents daily — here's what I actually learned.
Why RAG Fails in Production
The toy version is seductive: embed some documents, do a cosine similarity search, shove the top-k chunks into a prompt. It works on your 10-document test set. It falls apart when your corpus hits 50,000 clauses, your users ask ambiguous questions, and your retrieval returns chunks that are technically similar but contextually useless.
The problems cluster into three areas: retrieval quality, chunk design, and answer grounding. Fix all three or fix none — they compound.
Chunk Design Is Everything
Fixed-size chunking (split every 512 tokens) is a trap. Legal documents aren't uniform — a single clause can be 20 tokens or 400. What matters is semantic coherence per chunk.
What worked for us:
- Hierarchical chunking — keep a parent document summary + child clause chunks. Retrieve at the clause level, but include the parent summary in context.
- Overlap matters — 15-20% token overlap between adjacent chunks. Clauses reference each other constantly.
- Metadata injection — every chunk carries the document title, section header, and clause number. This alone improved answer precision by ~30%.
The chunk is the unit of retrieval. If your chunks don't make sense in isolation, your answers won't either.
Retrieval: Beyond Cosine Similarity
Vanilla dense retrieval misses exact-match queries. A lawyer searching for "indemnification cap" wants the exact clause, not semantically adjacent paragraphs about liability.
We run a hybrid pipeline:
query
-> BM25 (sparse, keyword match) -> top-20
-> dense embedding search -> top-20
-> reciprocal rank fusion -> merged top-30
-> cross-encoder re-ranker -> final top-5 The cross-encoder re-ranker is the most impactful single addition. It's slower (~80ms extra latency) but improves relevance dramatically because it scores query-chunk pairs jointly rather than independently.
Grounding: Don't Let the Model Hallucinate
With legal documents, hallucination isn't a UX problem — it's a liability problem. Every answer must cite the exact clause it came from.
The trick: include chunk identifiers in the prompt, instruct the model to cite by ID, then resolve citations back to source text in the API response. Users see the answer and the exact paragraph it was drawn from, highlighted in the original document.
system: You are a legal document analyst.
Answer only from the provided clauses.
Cite each claim using [CLAUSE-ID].
If the answer is not in the provided clauses, say "Not found in document."
context:
[CLAUSE-47] The total liability of either party...
[CLAUSE-48] Indemnification obligations shall survive... Latency: Keeping It Under 2 Seconds
Our SLA is under 2s end-to-end. The pipeline breakdown:
- Query embedding: ~30ms (batched, cached for repeated queries)
- BM25 + dense retrieval: ~120ms
- Re-ranking: ~80ms
- LLM inference: ~900ms (streaming, so perceived latency is lower)
Streaming the LLM response was the biggest perceived-latency win. First token appears in ~400ms even if the full answer takes 1.2s.
What I'd Do Differently
Start with re-ranking earlier. We bolted it on after launch and it was painful to retrofit. If you're building RAG for anything beyond a toy, assume you'll need it from day one.
Also: evaluate constantly. We run 200 hand-labelled query-answer pairs through the pipeline on every deploy. RAG systems degrade silently — a model update, an embedding library bump, a corpus change. Automated evals catch it before your users do.
The full stack: pgvector for vector storage, BM25s for sparse retrieval, sentence-transformers for embeddings, cross-encoders for re-ranking, FastAPI serving the pipeline, Next.js on the frontend. Everything containerised with Docker, deployed on a single beefy VM — no Kubernetes until you need it.