← back to blog

AI/ML

RAG Pipeline Architecture: Lessons from Production

What actually works when you move beyond the tutorial

2026-04-01
ragllmarchitectureproduction

The Tutorial RAG vs Production RAG

Every RAG tutorial looks the same:

  1. Chunk documents
  2. Embed chunks
  3. Store in vector DB
  4. Query → retrieve → generate

In production, this falls apart spectacularly. Here's what I've learned building RAG systems at Intervues.

The Chunking Problem

Naive chunking (split every 500 tokens) destroys context. A code example split in half is useless. A table split across chunks loses its structure.

What works better:

graph TD
    A[Document] --> B[Semantic Chunking]
    B --> C{Chunk Type?}
    C --> D[Code Block → Keep Whole]
    C --> E[Table → Keep Whole]
    C --> F[Paragraph → Split at Sentence Boundaries]
    C --> G[Header Section → Use as Metadata]
    
    D --> H[Embed with Context Window]
    E --> H
    F --> H
    G --> I[Attach as Metadata to Child Chunks]

Semantic Chunking Strategy

def semantic_chunk(text, max_tokens=500, overlap=50):
    """Chunk text respecting semantic boundaries."""
    sections = split_by_headers(text)
    chunks = []
    
    for section in sections:
        header = section['header']
        content = section['content']
        
        # Keep code blocks and tables intact
        blocks = extract_special_blocks(content)
        
        for block in blocks:
            if block['type'] in ('code', 'table'):
                chunks.append({
                    'text': block['text'],
                    'metadata': {'header': header, 'type': block['type']},
                })
            else:
                # Split prose at sentence boundaries
                sentences = split_sentences(block['text'])
                current_chunk = []
                current_tokens = 0
                
                for sentence in sentences:
                    sent_tokens = count_tokens(sentence)
                    if current_tokens + sent_tokens > max_tokens and current_chunk:
                        chunks.append({
                            'text': ' '.join(current_chunk),
                            'metadata': {'header': header, 'type': 'prose'},
                        })
                        # Overlap: keep last sentence
                        current_chunk = current_chunk[-1:]
                        current_tokens = count_tokens(current_chunk[0])
                    current_chunk.append(sentence)
                    current_tokens += sent_tokens
                
                if current_chunk:
                    chunks.append({
                        'text': ' '.join(current_chunk),
                        'metadata': {'header': header, 'type': 'prose'},
                    })
    
    return chunks

Retrieval: Hybrid is King

Pure vector similarity search has a recall ceiling. In practice, combining vector search with keyword search (BM25) consistently outperforms either alone.

Method Recall@10 Precision@10
Vector only (ada-002) 72% 68%
BM25 only 65% 71%
Hybrid (RRF fusion) 84% 79%
Hybrid + reranker 89% 85%

The Reranking Step

After retrieval, run a cross-encoder reranker. It's more expensive than bi-encoder retrieval, but dramatically improves precision:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query, chunks, top_k=5):
    pairs = [(query, chunk['text']) for chunk in chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
    return [chunk for chunk, score in ranked[:top_k]]

What I'd Do Differently

  1. Start with evaluation — build an eval set before optimizing anything
  2. Chunk by document structure, not token count
  3. Always use hybrid retrieval — vector + BM25 + reranker
  4. Cache aggressively — same query patterns repeat more than you'd think
  5. Monitor retrieval quality in production, not just generation quality

RAG is deceptively simple in concept and endlessly complex in practice. The gap between demo and production is where the real engineering lives.

← all posts

Ayoush Chourasia · 2026-04-01