AI/ML
RAG Pipeline Architecture: Lessons from Production
What actually works when you move beyond the tutorial
The Tutorial RAG vs Production RAG
Every RAG tutorial looks the same:
- Chunk documents
- Embed chunks
- Store in vector DB
- Query → retrieve → generate
In production, this falls apart spectacularly. Here's what I've learned building RAG systems at Intervues.
The Chunking Problem
Naive chunking (split every 500 tokens) destroys context. A code example split in half is useless. A table split across chunks loses its structure.
What works better:
graph TD
A[Document] --> B[Semantic Chunking]
B --> C{Chunk Type?}
C --> D[Code Block → Keep Whole]
C --> E[Table → Keep Whole]
C --> F[Paragraph → Split at Sentence Boundaries]
C --> G[Header Section → Use as Metadata]
D --> H[Embed with Context Window]
E --> H
F --> H
G --> I[Attach as Metadata to Child Chunks]Semantic Chunking Strategy
def semantic_chunk(text, max_tokens=500, overlap=50):
"""Chunk text respecting semantic boundaries."""
sections = split_by_headers(text)
chunks = []
for section in sections:
header = section['header']
content = section['content']
# Keep code blocks and tables intact
blocks = extract_special_blocks(content)
for block in blocks:
if block['type'] in ('code', 'table'):
chunks.append({
'text': block['text'],
'metadata': {'header': header, 'type': block['type']},
})
else:
# Split prose at sentence boundaries
sentences = split_sentences(block['text'])
current_chunk = []
current_tokens = 0
for sentence in sentences:
sent_tokens = count_tokens(sentence)
if current_tokens + sent_tokens > max_tokens and current_chunk:
chunks.append({
'text': ' '.join(current_chunk),
'metadata': {'header': header, 'type': 'prose'},
})
# Overlap: keep last sentence
current_chunk = current_chunk[-1:]
current_tokens = count_tokens(current_chunk[0])
current_chunk.append(sentence)
current_tokens += sent_tokens
if current_chunk:
chunks.append({
'text': ' '.join(current_chunk),
'metadata': {'header': header, 'type': 'prose'},
})
return chunksRetrieval: Hybrid is King
Pure vector similarity search has a recall ceiling. In practice, combining vector search with keyword search (BM25) consistently outperforms either alone.
| Method | Recall@10 | Precision@10 |
|---|---|---|
| Vector only (ada-002) | 72% | 68% |
| BM25 only | 65% | 71% |
| Hybrid (RRF fusion) | 84% | 79% |
| Hybrid + reranker | 89% | 85% |
The Reranking Step
After retrieval, run a cross-encoder reranker. It's more expensive than bi-encoder retrieval, but dramatically improves precision:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(query, chunks, top_k=5):
pairs = [(query, chunk['text']) for chunk in chunks]
scores = reranker.predict(pairs)
ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
return [chunk for chunk, score in ranked[:top_k]]What I'd Do Differently
- Start with evaluation — build an eval set before optimizing anything
- Chunk by document structure, not token count
- Always use hybrid retrieval — vector + BM25 + reranker
- Cache aggressively — same query patterns repeat more than you'd think
- Monitor retrieval quality in production, not just generation quality
RAG is deceptively simple in concept and endlessly complex in practice. The gap between demo and production is where the real engineering lives.
Ayoush Chourasia · 2026-04-01