RAG, GRAPH RAG AND VECTOR DATABASES

Production RAG Systems with Hybrid Search

RAG system architecture with hybrid search and vector databases for AI

Retrieval-Augmented Generation (RAG) has become the standard architecture for LLM applications that need accurate, up-to-date information. However, naive RAG implementations often fail in production. Here’s how to build systems that actually work.

Why Basic RAG Fails

Simple vector similarity search has critical limitations:

  • Poor performance on exact matches (product codes, names)
  • Struggles with low-frequency terms and acronyms
  • No understanding of document structure or metadata
  • Sensitive to query phrasing variations

Hybrid search solves these issues by combining dense (semantic) and sparse (keyword) retrieval.

Architecture Overview

User Query
    ↓
Query Enhancement (expansion, rewriting)
    ↓
Parallel Retrieval
    ├→ Dense Search (embeddings)
    └→ Sparse Search (BM25)
    ↓
Result Fusion (RRF)
    ↓
Reranking (cross-encoder)
    ↓
Context Construction
    ↓
LLM Generation

Vector Database Selection

Database Best For Hybrid Search
Qdrant High performance, Rust-based Native
Weaviate Rich features, GraphQL API Native
Typesense Typo tolerance, faceting Excellent
Milvus Massive scale (>1B vectors) Via plugin
pgvector PostgreSQL integration Manual

For most applications, Qdrant or Weaviate provide the best balance of features and performance.

Implementing Dense Search

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient(url="http://localhost:6333")

# Create collection
client.create_collection(
    collection_name="documents",
    vectors_config={
        "dense": VectorParams(
            size=1024,  # e.g., OpenAI text-embedding-3-large
            distance=Distance.COSINE
        )
    }
)

# Index documents
from openai import OpenAI
openai = OpenAI()

def embed_text(text: str):
    response = openai.embeddings.create(
        model="text-embedding-3-large",
        input=text
    )
    return response.data[0].embedding

# Batch insert
points = []
for doc_id, doc in enumerate(documents):
    points.append({
        "id": doc_id,
        "vector": {"dense": embed_text(doc["content"])},
        "payload": doc
    })

client.upsert(collection_name="documents", points=points)

Adding Sparse Search

from qdrant_client.models import SparseVector

# Configure sparse vectors (BM25-like)
client.create_collection(
    collection_name="documents",
    vectors_config={
        "dense": VectorParams(size=1024, distance=Distance.COSINE)
    },
    sparse_vectors_config={
        "sparse": SparseVectorParams()
    }
)

# Use SPLADE or BM25 for sparse encoding
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("naver/splade-cocondenser-ensembledistil")
model = AutoModelForMaskedLM.from_pretrained("naver/splade-cocondenser-ensembledistil")

def sparse_encode(text: str):
    tokens = tokenizer(text, return_tensors="pt")
    output = model(**tokens)
    # Extract sparse vector (token activations)
    vec = torch.max(torch.log(1 + torch.relu(output.logits)), dim=1).values
    return SparseVector(indices=vec.nonzero().flatten().tolist(),
                       values=vec[vec > 0].tolist())

Hybrid Search Query

from qdrant_client.models import Prefetch, QueryRequest

def hybrid_search(query: str, limit: int = 10):
    dense_vec = embed_text(query)
    sparse_vec = sparse_encode(query)

    results = client.query_points(
        collection_name="documents",
        prefetch=[
            # Dense search
            Prefetch(
                query=dense_vec,
                using="dense",
                limit=20  # over-fetch
            ),
            # Sparse search
            Prefetch(
                query=sparse_vec,
                using="sparse",
                limit=20
            )
        ],
        query=FusionQuery(fusion=Fusion.RRF),  # Reciprocal Rank Fusion
        limit=limit
    )

    return results

Reranking for Precision

After retrieval, rerank with a cross-encoder for maximum relevance:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, documents: list, top_k: int = 5):
    pairs = [[query, doc["content"]] for doc in documents]
    scores = reranker.predict(pairs)

    # Sort by score
    ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, score in ranked[:top_k]]

# Full pipeline
results = hybrid_search(query, limit=20)
final_docs = rerank(query, results, top_k=5)

Query Enhancement Techniques

1. Query Expansion:

def expand_query(query: str) -> str:
    prompt = f"""Generate 2 alternative phrasings of this query:

Query: {query}

Alternative phrasings:
1."""

    response = llm.generate(prompt)
    expansions = parse_expansions(response)
    return query + " " + " ".join(expansions)

2. Query Decomposition:

def decompose_query(complex_query: str) -> list[str]:
    """Break complex queries into sub-queries"""
    prompt = f"""Break this complex question into 2-3 simpler sub-questions:

Question: {complex_query}

Sub-questions:
1."""

    sub_queries = llm.generate(prompt).strip().split("n")
    return sub_queries

# Retrieve for each sub-query and combine

Chunking Strategies

Chunk size dramatically impacts retrieval quality:

Strategy Chunk Size Use Case
Fixed 512 tokens Simple, fast
Sentence Variable Preserves meaning
Semantic Variable Topic coherence
Recursive Hierarchical Long documents
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,  # maintain context
    separators=["nn", "n", ". ", " ", ""]
)

chunks = splitter.split_text(document)

Metadata Filtering

Combine vector search with metadata filters for precise results:

from qdrant_client.models import Filter, FieldCondition, MatchValue

results = client.query_points(
    collection_name="documents",
    query=query_vector,
    query_filter=Filter(
        must=[
            FieldCondition(
                key="category",
                match=MatchValue(value="technical")
            ),
            FieldCondition(
                key="date",
                range=DateRange(gte="2025-01-01")
            )
        ]
    ),
    limit=10
)

Context Construction

Optimize how you present retrieved chunks to the LLM:

def build_context(query: str, docs: list) -> str:
    context_parts = []

    for i, doc in enumerate(docs, 1):
        # Include metadata for provenance
        context_parts.append(f"""
Document {i} (Source: {doc["source"]}, Date: {doc["date"]}):
{doc["content"]}
---
""")

    return "n".join(context_parts)

prompt = f"""Use the following documents to answer the question.
If the answer is not in the documents, say so.

{context}

Question: {query}
Answer:"""

Caching for Performance

import redis
import hashlib

redis_client = redis.Redis(host="localhost", port=6379, db=0)

def cached_search(query: str, ttl: int = 3600):
    # Cache key from query hash
    cache_key = f"rag:{hashlib.sha256(query.encode()).hexdigest()}"

    # Check cache
    cached = redis_client.get(cache_key)
    if cached:
        return json.loads(cached)

    # Perform search
    results = hybrid_search(query)

    # Cache results
    redis_client.setex(cache_key, ttl, json.dumps(results))

    return results

Monitoring and Evaluation

Track key metrics in production:

  • Retrieval metrics: Recall@k, MRR, nDCG
  • Generation metrics: Faithfulness, answer relevance
  • System metrics: Latency p95, cache hit rate
  • User metrics: Thumbs up/down, follow-up questions

Cost Optimization

# Costs for 1M queries/month

# Embeddings
1M queries × $0.13/1M tokens = $0.13

# Vector DB (Qdrant Cloud)
Standard tier: $99/month

# LLM (Claude 3.5 Sonnet)
1M queries × 1K tokens × $3/1M = $3,000

Total: ~$3,100/month

The LLM is by far the largest cost. Optimize by caching, using smaller models when possible, and efficient prompt engineering.