RAG Architecture Patterns: What Works in Production
Retrieval-Augmented Generation sounds simple. Getting it to work reliably at scale requires careful attention to chunking, embeddings, reranking, and evaluation.
The Naive RAG Pipeline
Most tutorials show you this basic flow: chunk documents → embed → store in vector DB → retrieve top-k → concatenate with prompt → generate. This works for demos. In production, it falls apart quickly.
Common failure modes we've seen:
- Chunks split mid-sentence – Losing context and coherence
- Semantic similarity ≠ relevance – Related but unhelpful documents
- No recency weighting – Outdated information ranked equally
- Single retrieval pass – Missing documents that need query reformulation
Pattern 1: Semantic Chunking
Instead of splitting by fixed character count, chunk by semantic boundaries. This preserves meaning and context.
# Chunking Strategies
Split on sentence boundaries, include 2-3 sentence overlap
Natural semantic units, works well for structured docs
Small chunks for retrieval, expand to parent for context
Overlapping windows, merge adjacent retrieved chunks
In practice, we often use hierarchical chunking: small chunks (200-400 tokens) for precise retrieval, but expand to the parent section (1000-2000 tokens) before passing to the LLM.
Pattern 2: Hybrid Search
Vector search alone misses keyword-specific queries. Hybrid search combines vector similarity with traditional BM25/keyword matching.
# Hybrid retrieval with RRF (Reciprocal Rank Fusion)
vector_results = vector_db.search(query_embedding, k=20)
keyword_results = elasticsearch.search(query_text, k=20)
# Combine with RRF scoring
final_results = reciprocal_rank_fusion(
[vector_results, keyword_results],
k=60 # RRF constant
)This is especially important for technical documentation where exact terms matter. "error code E2451" needs keyword match, not just semantic similarity to "error."
Pattern 3: Query Transformation
User queries are often vague or incomplete. Transform them before retrieval:
Query Expansion
Generate multiple phrasings of the same question
"pricing" → "pricing", "cost", "subscription", "plans"
HyDE (Hypothetical Doc)
Generate a hypothetical answer, use it for retrieval
Better matches for Q&A over documentation
Step-Back Prompting
Ask a more general question first
"Why is X slow?" → "How does X work?"
Multi-Query
Break complex queries into sub-questions
Retrieve for each, merge results
Pattern 4: Reranking
Vector similarity is a rough filter. Reranking with a cross-encoder dramatically improves precision. The two-stage approach:
# Stage 1: Fast retrieval (bi-encoder)
candidates = vector_db.search(query, k=100) # Over-retrieve
# Stage 2: Precise reranking (cross-encoder)
reranker = CrossEncoder("ms-marco-MiniLM-L-12-v2")
scores = reranker.predict([(query, doc.text) for doc in candidates])
top_k = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]Cross-encoders are 10-100x slower than bi-encoders, but they see the query and document together, enabling much better relevance scoring. The latency trade-off is worth it for the final ranking stage.
Pattern 5: Metadata Filtering
Don't just rely on semantic similarity. Use structured metadata to pre-filter:
- Recency – Prefer documents updated in the last N months
- Source authority – Official docs > community forums
- Document type – Tutorial vs. API reference vs. troubleshooting
- User permissions – Only retrieve what the user can access
# Metadata-aware retrieval
results = vector_db.search(
query_embedding,
filter={
"updated_at": {"$gt": "2024-01-01"},
"source": {"$in": ["official_docs", "engineering_blog"]},
"access_level": {"$lte": user.access_level}
},
k=20
)Pattern 6: Evaluation Pipeline
You can't improve what you don't measure. Set up continuous evaluation:
Metrics to Track
- • Context Precision @ k
- • Context Recall
- • Mean Reciprocal Rank
- • Answer Relevance
- • Faithfulness (grounding)
- • Answer Completeness
Build a test set of 50-100 question-answer pairs with ground truth. Run evaluations after every change to the pipeline. Tools like RAGAS, LangSmith, and Arize Phoenix can help automate this.
Production Checklist
The Bottom Line
RAG isn't a single architecture—it's a family of patterns you compose based on your data and requirements. Start simple, measure everything, and add complexity only when the metrics justify it. The goal isn't the fanciest pipeline; it's the pipeline that reliably answers your users' questions.
Building a RAG system?
We've built production RAG systems for enterprises handling millions of documents. If you're struggling with retrieval quality or scaling, let's talk.
Discuss Your RAG Architecture