
Retrieval-Augmented Generation (RAG) has become the standard architecture for LLM applications that need accurate, up-to-date information. However, naive RAG implementations often fail in production. Here’s how to build systems that actually work.
Why Basic RAG Fails
Simple vector similarity search has critical limitations:
- Poor performance on exact matches (product codes, names)
- Struggles with low-frequency terms and acronyms
- No understanding of document structure or metadata
- Sensitive to query phrasing variations
Hybrid search solves these issues by combining dense (semantic) and sparse (keyword) retrieval.
Architecture Overview
User Query
↓
Query Enhancement (expansion, rewriting)
↓
Parallel Retrieval
├→ Dense Search (embeddings)
└→ Sparse Search (BM25)
↓
Result Fusion (RRF)
↓
Reranking (cross-encoder)
↓
Context Construction
↓
LLM Generation
Vector Database Selection
Database | Best For | Hybrid Search |
---|---|---|
Qdrant | High performance, Rust-based | Native |
Weaviate | Rich features, GraphQL API | Native |
Typesense | Typo tolerance, faceting | Excellent |
Milvus | Massive scale (>1B vectors) | Via plugin |
pgvector | PostgreSQL integration | Manual |
For most applications, Qdrant or Weaviate provide the best balance of features and performance.
Implementing Dense Search
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
client = QdrantClient(url="http://localhost:6333")
# Create collection
client.create_collection(
collection_name="documents",
vectors_config={
"dense": VectorParams(
size=1024, # e.g., OpenAI text-embedding-3-large
distance=Distance.COSINE
)
}
)
# Index documents
from openai import OpenAI
openai = OpenAI()
def embed_text(text: str):
response = openai.embeddings.create(
model="text-embedding-3-large",
input=text
)
return response.data[0].embedding
# Batch insert
points = []
for doc_id, doc in enumerate(documents):
points.append({
"id": doc_id,
"vector": {"dense": embed_text(doc["content"])},
"payload": doc
})
client.upsert(collection_name="documents", points=points)
Adding Sparse Search
from qdrant_client.models import SparseVector
# Configure sparse vectors (BM25-like)
client.create_collection(
collection_name="documents",
vectors_config={
"dense": VectorParams(size=1024, distance=Distance.COSINE)
},
sparse_vectors_config={
"sparse": SparseVectorParams()
}
)
# Use SPLADE or BM25 for sparse encoding
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("naver/splade-cocondenser-ensembledistil")
model = AutoModelForMaskedLM.from_pretrained("naver/splade-cocondenser-ensembledistil")
def sparse_encode(text: str):
tokens = tokenizer(text, return_tensors="pt")
output = model(**tokens)
# Extract sparse vector (token activations)
vec = torch.max(torch.log(1 + torch.relu(output.logits)), dim=1).values
return SparseVector(indices=vec.nonzero().flatten().tolist(),
values=vec[vec > 0].tolist())
Hybrid Search Query
from qdrant_client.models import Prefetch, QueryRequest
def hybrid_search(query: str, limit: int = 10):
dense_vec = embed_text(query)
sparse_vec = sparse_encode(query)
results = client.query_points(
collection_name="documents",
prefetch=[
# Dense search
Prefetch(
query=dense_vec,
using="dense",
limit=20 # over-fetch
),
# Sparse search
Prefetch(
query=sparse_vec,
using="sparse",
limit=20
)
],
query=FusionQuery(fusion=Fusion.RRF), # Reciprocal Rank Fusion
limit=limit
)
return results
Reranking for Precision
After retrieval, rerank with a cross-encoder for maximum relevance:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, documents: list, top_k: int = 5):
pairs = [[query, doc["content"]] for doc in documents]
scores = reranker.predict(pairs)
# Sort by score
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:top_k]]
# Full pipeline
results = hybrid_search(query, limit=20)
final_docs = rerank(query, results, top_k=5)
Query Enhancement Techniques
1. Query Expansion:
def expand_query(query: str) -> str:
prompt = f"""Generate 2 alternative phrasings of this query:
Query: {query}
Alternative phrasings:
1."""
response = llm.generate(prompt)
expansions = parse_expansions(response)
return query + " " + " ".join(expansions)
2. Query Decomposition:
def decompose_query(complex_query: str) -> list[str]:
"""Break complex queries into sub-queries"""
prompt = f"""Break this complex question into 2-3 simpler sub-questions:
Question: {complex_query}
Sub-questions:
1."""
sub_queries = llm.generate(prompt).strip().split("n")
return sub_queries
# Retrieve for each sub-query and combine
Chunking Strategies
Chunk size dramatically impacts retrieval quality:
Strategy | Chunk Size | Use Case |
---|---|---|
Fixed | 512 tokens | Simple, fast |
Sentence | Variable | Preserves meaning |
Semantic | Variable | Topic coherence |
Recursive | Hierarchical | Long documents |
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50, # maintain context
separators=["nn", "n", ". ", " ", ""]
)
chunks = splitter.split_text(document)
Metadata Filtering
Combine vector search with metadata filters for precise results:
from qdrant_client.models import Filter, FieldCondition, MatchValue
results = client.query_points(
collection_name="documents",
query=query_vector,
query_filter=Filter(
must=[
FieldCondition(
key="category",
match=MatchValue(value="technical")
),
FieldCondition(
key="date",
range=DateRange(gte="2025-01-01")
)
]
),
limit=10
)
Context Construction
Optimize how you present retrieved chunks to the LLM:
def build_context(query: str, docs: list) -> str:
context_parts = []
for i, doc in enumerate(docs, 1):
# Include metadata for provenance
context_parts.append(f"""
Document {i} (Source: {doc["source"]}, Date: {doc["date"]}):
{doc["content"]}
---
""")
return "n".join(context_parts)
prompt = f"""Use the following documents to answer the question.
If the answer is not in the documents, say so.
{context}
Question: {query}
Answer:"""
Caching for Performance
import redis
import hashlib
redis_client = redis.Redis(host="localhost", port=6379, db=0)
def cached_search(query: str, ttl: int = 3600):
# Cache key from query hash
cache_key = f"rag:{hashlib.sha256(query.encode()).hexdigest()}"
# Check cache
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
# Perform search
results = hybrid_search(query)
# Cache results
redis_client.setex(cache_key, ttl, json.dumps(results))
return results
Monitoring and Evaluation
Track key metrics in production:
- Retrieval metrics: Recall@k, MRR, nDCG
- Generation metrics: Faithfulness, answer relevance
- System metrics: Latency p95, cache hit rate
- User metrics: Thumbs up/down, follow-up questions
Cost Optimization
# Costs for 1M queries/month
# Embeddings
1M queries × $0.13/1M tokens = $0.13
# Vector DB (Qdrant Cloud)
Standard tier: $99/month
# LLM (Claude 3.5 Sonnet)
1M queries × 1K tokens × $3/1M = $3,000
Total: ~$3,100/month
The LLM is by far the largest cost. Optimize by caching, using smaller models when possible, and efficient prompt engineering.