Back to Insights
Agentic AI9 min readNov 20, 2024

RAG Architecture Patterns: What Works in Production

Retrieval-Augmented Generation sounds simple. Getting it to work reliably at scale requires careful attention to chunking, embeddings, reranking, and evaluation.

Ingestion
Chunking
Retrieval
Generation

The Naive RAG Pipeline

Most tutorials show you this basic flow: chunk documents → embed → store in vector DB → retrieve top-k → concatenate with prompt → generate. This works for demos. In production, it falls apart quickly.

Common failure modes we've seen:

  • Chunks split mid-sentence – Losing context and coherence
  • Semantic similarity ≠ relevance – Related but unhelpful documents
  • No recency weighting – Outdated information ranked equally
  • Single retrieval pass – Missing documents that need query reformulation

Pattern 1: Semantic Chunking

Instead of splitting by fixed character count, chunk by semantic boundaries. This preserves meaning and context.

# Chunking Strategies

1
Sentence-based with overlap

Split on sentence boundaries, include 2-3 sentence overlap

2
Paragraph-level

Natural semantic units, works well for structured docs

3
Hierarchical (parent-child)

Small chunks for retrieval, expand to parent for context

4
Sliding window with merge

Overlapping windows, merge adjacent retrieved chunks

In practice, we often use hierarchical chunking: small chunks (200-400 tokens) for precise retrieval, but expand to the parent section (1000-2000 tokens) before passing to the LLM.

Pattern 2: Hybrid Search

Vector search alone misses keyword-specific queries. Hybrid search combines vector similarity with traditional BM25/keyword matching.

# Hybrid retrieval with RRF (Reciprocal Rank Fusion) vector_results = vector_db.search(query_embedding, k=20) keyword_results = elasticsearch.search(query_text, k=20) # Combine with RRF scoring final_results = reciprocal_rank_fusion( [vector_results, keyword_results], k=60 # RRF constant )

This is especially important for technical documentation where exact terms matter. "error code E2451" needs keyword match, not just semantic similarity to "error."

Pattern 3: Query Transformation

User queries are often vague or incomplete. Transform them before retrieval:

Query Expansion

Generate multiple phrasings of the same question

"pricing" → "pricing", "cost", "subscription", "plans"

HyDE (Hypothetical Doc)

Generate a hypothetical answer, use it for retrieval

Better matches for Q&A over documentation

Step-Back Prompting

Ask a more general question first

"Why is X slow?" → "How does X work?"

Multi-Query

Break complex queries into sub-questions

Retrieve for each, merge results

Pattern 4: Reranking

Vector similarity is a rough filter. Reranking with a cross-encoder dramatically improves precision. The two-stage approach:

# Stage 1: Fast retrieval (bi-encoder) candidates = vector_db.search(query, k=100) # Over-retrieve # Stage 2: Precise reranking (cross-encoder) reranker = CrossEncoder("ms-marco-MiniLM-L-12-v2") scores = reranker.predict([(query, doc.text) for doc in candidates]) top_k = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]

Cross-encoders are 10-100x slower than bi-encoders, but they see the query and document together, enabling much better relevance scoring. The latency trade-off is worth it for the final ranking stage.

Pattern 5: Metadata Filtering

Don't just rely on semantic similarity. Use structured metadata to pre-filter:

  • Recency – Prefer documents updated in the last N months
  • Source authority – Official docs > community forums
  • Document type – Tutorial vs. API reference vs. troubleshooting
  • User permissions – Only retrieve what the user can access
# Metadata-aware retrieval results = vector_db.search( query_embedding, filter={ "updated_at": {"$gt": "2024-01-01"}, "source": {"$in": ["official_docs", "engineering_blog"]}, "access_level": {"$lte": user.access_level} }, k=20 )

Pattern 6: Evaluation Pipeline

You can't improve what you don't measure. Set up continuous evaluation:

Metrics to Track

Retrieval Quality
  • • Context Precision @ k
  • • Context Recall
  • • Mean Reciprocal Rank
Generation Quality
  • • Answer Relevance
  • • Faithfulness (grounding)
  • • Answer Completeness

Build a test set of 50-100 question-answer pairs with ground truth. Run evaluations after every change to the pipeline. Tools like RAGAS, LangSmith, and Arize Phoenix can help automate this.

Production Checklist

1
Semantic chunking with overlap
2
Hybrid search (vector + keyword)
3
Cross-encoder reranking
4
Metadata filtering and access control
5
Query transformation strategy
6
Continuous evaluation pipeline
7
Caching for repeated queries
8
Fallback for retrieval failures
9
Observability and logging

The Bottom Line

RAG isn't a single architecture—it's a family of patterns you compose based on your data and requirements. Start simple, measure everything, and add complexity only when the metrics justify it. The goal isn't the fanciest pipeline; it's the pipeline that reliably answers your users' questions.

Building a RAG system?

We've built production RAG systems for enterprises handling millions of documents. If you're struggling with retrieval quality or scaling, let's talk.

Discuss Your RAG Architecture
WhatsApp
Chat with us