RAG Architecture Patterns: What Works in Production

Ingestion

Chunking

Retrieval

Generation

The Naive RAG Pipeline

Most tutorials show you this basic flow: chunk documents → embed → store in vector DB → retrieve top-k → concatenate with prompt → generate. This works for demos. In production, it falls apart quickly.

Common failure modes we've seen:

Chunks split mid-sentence – Losing context and coherence
Semantic similarity ≠ relevance – Related but unhelpful documents
No recency weighting – Outdated information ranked equally
Single retrieval pass – Missing documents that need query reformulation

Pattern 1: Semantic Chunking

Instead of splitting by fixed character count, chunk by semantic boundaries. This preserves meaning and context.

# Chunking Strategies

Sentence-based with overlap

Split on sentence boundaries, include 2-3 sentence overlap

Paragraph-level

Natural semantic units, works well for structured docs

Hierarchical (parent-child)

Small chunks for retrieval, expand to parent for context

Sliding window with merge

Overlapping windows, merge adjacent retrieved chunks

In practice, we often use hierarchical chunking: small chunks (200-400 tokens) for precise retrieval, but expand to the parent section (1000-2000 tokens) before passing to the LLM.

Pattern 2: Hybrid Search

Vector search alone misses keyword-specific queries. Hybrid search combines vector similarity with traditional BM25/keyword matching.

# Hybrid retrieval with RRF (Reciprocal Rank Fusion)
vector_results = vector_db.search(query_embedding, k=20)
keyword_results = elasticsearch.search(query_text, k=20)

# Combine with RRF scoring
final_results = reciprocal_rank_fusion(
    [vector_results, keyword_results],
    k=60  # RRF constant
)

This is especially important for technical documentation where exact terms matter. "error code E2451" needs keyword match, not just semantic similarity to "error."

Pattern 3: Query Transformation

User queries are often vague or incomplete. Transform them before retrieval:

Query Expansion

Generate multiple phrasings of the same question

"pricing" → "pricing", "cost", "subscription", "plans"

HyDE (Hypothetical Doc)

Generate a hypothetical answer, use it for retrieval

Better matches for Q&A over documentation

Step-Back Prompting

Ask a more general question first

"Why is X slow?" → "How does X work?"

Multi-Query

Break complex queries into sub-questions

Retrieve for each, merge results

Pattern 4: Reranking

Vector similarity is a rough filter. Reranking with a cross-encoder dramatically improves precision. The two-stage approach:

# Stage 1: Fast retrieval (bi-encoder)
candidates = vector_db.search(query, k=100)  # Over-retrieve

# Stage 2: Precise reranking (cross-encoder)
reranker = CrossEncoder("ms-marco-MiniLM-L-12-v2")
scores = reranker.predict([(query, doc.text) for doc in candidates])
top_k = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5]

Cross-encoders are 10-100x slower than bi-encoders, but they see the query and document together, enabling much better relevance scoring. The latency trade-off is worth it for the final ranking stage.

Pattern 5: Metadata Filtering

Don't just rely on semantic similarity. Use structured metadata to pre-filter:

Recency – Prefer documents updated in the last N months
Source authority – Official docs > community forums
Document type – Tutorial vs. API reference vs. troubleshooting
User permissions – Only retrieve what the user can access

# Metadata-aware retrieval
results = vector_db.search(
    query_embedding,
    filter={
        "updated_at": {"$gt": "2024-01-01"},
        "source": {"$in": ["official_docs", "engineering_blog"]},
        "access_level": {"$lte": user.access_level}
    },
    k=20
)

Pattern 6: Evaluation Pipeline

You can't improve what you don't measure. Set up continuous evaluation:

Metrics to Track

Retrieval Quality

• Context Precision @ k
• Context Recall
• Mean Reciprocal Rank

Generation Quality

• Answer Relevance
• Faithfulness (grounding)
• Answer Completeness

Build a test set of 50-100 question-answer pairs with ground truth. Run evaluations after every change to the pipeline. Tools like RAGAS, LangSmith, and Arize Phoenix can help automate this.

Production Checklist

Semantic chunking with overlap

Hybrid search (vector + keyword)

Cross-encoder reranking

Metadata filtering and access control

Query transformation strategy

Continuous evaluation pipeline

Caching for repeated queries

Fallback for retrieval failures

Observability and logging

The Bottom Line

RAG isn't a single architecture—it's a family of patterns you compose based on your data and requirements. Start simple, measure everything, and add complexity only when the metrics justify it. The goal isn't the fanciest pipeline; it's the pipeline that reliably answers your users' questions.