AI/ML

RAG Pipeline Architecture: Lessons from Production

What actually works when you move beyond the tutorial

2026-04-01

ragllmarchitectureproduction

The Tutorial RAG vs Production RAG

Every RAG tutorial looks the same:

Chunk documents
Embed chunks
Store in vector DB
Query → retrieve → generate

In production, this falls apart spectacularly. Here's what I've learned building RAG systems at Intervues.

The Chunking Problem

Naive chunking (split every 500 tokens) destroys context. A code example split in half is useless. A table split across chunks loses its structure.

What works better:

graph TD
    A[Document] --> B[Semantic Chunking]
    B --> C{Chunk Type?}
    C --> D[Code Block → Keep Whole]
    C --> E[Table → Keep Whole]
    C --> F[Paragraph → Split at Sentence Boundaries]
    C --> G[Header Section → Use as Metadata]
    
    D --> H[Embed with Context Window]
    E --> H
    F --> H
    G --> I[Attach as Metadata to Child Chunks]

Semantic Chunking Strategy

def semantic_chunk(text, max_tokens=500, overlap=50):
    """Chunk text respecting semantic boundaries."""
    sections = split_by_headers(text)
    chunks = []
    
    for section in sections:
        header = section['header']
        content = section['content']
        
        # Keep code blocks and tables intact
        blocks = extract_special_blocks(content)
        
        for block in blocks:
            if block['type'] in ('code', 'table'):
                chunks.append({
                    'text': block['text'],
                    'metadata': {'header': header, 'type': block['type']},
                })
            else:
                # Split prose at sentence boundaries
                sentences = split_sentences(block['text'])
                current_chunk = []
                current_tokens = 0
                
                for sentence in sentences:
                    sent_tokens = count_tokens(sentence)
                    if current_tokens + sent_tokens > max_tokens and current_chunk:
                        chunks.append({
                            'text': ' '.join(current_chunk),
                            'metadata': {'header': header, 'type': 'prose'},
                        })
                        # Overlap: keep last sentence
                        current_chunk = current_chunk[-1:]
                        current_tokens = count_tokens(current_chunk[0])
                    current_chunk.append(sentence)
                    current_tokens += sent_tokens
                
                if current_chunk:
                    chunks.append({
                        'text': ' '.join(current_chunk),
                        'metadata': {'header': header, 'type': 'prose'},
                    })
    
    return chunks

Retrieval: Hybrid is King

Pure vector similarity search has a recall ceiling. In practice, combining vector search with keyword search (BM25) consistently outperforms either alone.

Method	Recall@10	Precision@10
Vector only (ada-002)	72%	68%
BM25 only	65%	71%
Hybrid (RRF fusion)	84%	79%
Hybrid + reranker	89%	85%

The Reranking Step

After retrieval, run a cross-encoder reranker. It's more expensive than bi-encoder retrieval, but dramatically improves precision:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query, chunks, top_k=5):
    pairs = [(query, chunk['text']) for chunk in chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
    return [chunk for chunk, score in ranked[:top_k]]

What I'd Do Differently

Start with evaluation — build an eval set before optimizing anything
Chunk by document structure, not token count
Always use hybrid retrieval — vector + BM25 + reranker
Cache aggressively — same query patterns repeat more than you'd think
Monitor retrieval quality in production, not just generation quality

RAG is deceptively simple in concept and endlessly complex in practice. The gap between demo and production is where the real engineering lives.

← all posts

Ayoush Chourasia · 2026-04-01