Disclosure

Important reader notice

This article is for general informational and educational purposes only. It is not legal, financial, tax, medical, security, compliance, or other professional advice, and you should not rely on it as a substitute for advice from a qualified professional who understands your specific situation.

AI tools, pricing, features, policies, laws, and platform terms can change quickly. We work to keep content accurate, but we do not guarantee that every detail is current, complete, or suitable for your use case. Always verify important claims with the original source before making business, legal, financial, safety, or purchasing decisions.

Some links may be affiliate, partner, or sponsored links. If you buy through them, AIUnpacking may earn compensation at no extra cost to you. Sponsored relationships are disclosed where applicable, and compensation does not override our editorial judgment.

Retrieval-augmented generation is the de facto pattern for building AI products on real documents in 2026. Instead of asking a model to answer from memory and hoping it doesn’t hallucinate a pricing table or a legal clause, you retrieve relevant source material at query time and ground the answer in that context.

Most RAG applications still struggle in production not because the LLM is weak, but because document quality, chunking, and retrieval are broken upstream. The answer quality is usually decided before the model sees the query.

This guide walks through a practical, production-minded RAG pipeline in Python. No magic chunk sizes, no one perfect database.

What A RAG Pipeline Actually Does

A RAG system has two phases: indexing (offline) and answering (real-time, per query).

Indexing phase:

  1. Load documents from PDFs, wikis, databases, markdown.
  2. Parse and clean raw text strip navigation, footers, broken OCR, boilerplate.
  3. Split into semantically meaningful chunks.
  4. Attach metadata: source URL, title, section heading, last-updated date, access control group.
  5. Generate vector embeddings for every chunk.
  6. Store chunks and embeddings in a searchable index.

Answering phase (per query):

  1. Optionally rewrite or decompose the query.
  2. Embed the query and find nearest candidate chunks.
  3. Rerank or filter candidates for precision.
  4. Build a strict prompt with the best context.
  5. Generate a source-grounded answer.
  6. Return inspectable citations.
  7. Log the query, retrieved chunks, scores, and latency for evaluation.

If you only optimize step 5, you are optimizing the wrong thing.

Step 1: Define Questions Before Touching a Vector

Before indexing a single document, write 30 to 100 realistic user questions. Include easy factoids, vague multi-intent queries, multi-hop questions spanning two documents, and questions where the answer is “the source material doesn’t cover this.”

For each, record the ideal answer, the exact source document and section, whether the answer is time-sensitive (pricing, policy), and whether a wrong answer has real consequences.

This is your evaluation set. Without it, every pipeline change is a vibes-based debate. Measure retrieval hit rate, answer faithfulness, citation accuracy, and refusal quality against it.

Step 2: Load and Clean Documents

Start with the documents users actually need. In the r/Rag community’s 2026 production survey, successful teams report one pattern: a small set of high-quality primary sources (10-20% of total docs) answers 80% of user questions. Start there.

Clean text before embedding. Remove navigation menus, duplicated headers, cookie banners, footers, and broken OCR. For PDFs, verify reading order jumbled columns destroy retrieval.

Loading documents with LangChain:

from langchain_community.document_loaders import TextLoader, PyPDFLoader, UnstructuredMarkdownLoader

pdf_docs = PyPDFLoader("product_manual.pdf").load()
md_docs = UnstructuredMarkdownLoader("internal_wiki.md").load()
txt_docs = TextLoader("changelog.txt").load()
all_documents = pdf_docs + md_docs + txt_docs

For every chunk, attach metadata: source URL, title, section heading, page number, last-updated timestamp, document type, and access control group. Metadata powers filtering, citation linking, freshness checks, and permission enforcement.

Step 3: Chunk by Meaning, Not Just Token Count

Chunking is where most RAG systems quietly break. Chunks under 200 tokens lose context. Chunks over 1500 tokens bury the answer in noise. A 2026 Firecrawl study comparing seven chunking strategies found that semantic and structure-based approaches consistently outperform naive fixed-size splitting.

Start with semantic boundaries heading breaks, section breaks, paragraph boundaries then apply size limits. A starting range of 500 to 1000 tokens per chunk works for most content. Add 10-15% overlap only where context genuinely spans a boundary. For code: chunk by function or class. For legal/policy: preserve clause numbers and section titles.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=768, chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_documents(all_documents)

For complex documents, late chunking is gaining traction: tokenize the full document first with a long-context embedding model so every token carries broader context, then split into chunks afterward. For critical docs, LLM-based chunking produces the cleanest boundaries but costs more.

Step 4: Choose Embeddings and a Vector Store

The embedding model needs to match your domain, language, and budget. In 2026, text-embedding-3-large (OpenAI) and bge-large-en-v1.5 (BAAI) lead the MTEB leaderboard for English RAG. For local-first or cost-sensitive deployments, all-MiniLM-L6-v2 (Sentence Transformers) delivers solid quality at 384 dimensions with zero API costs. Racine AI’s January 2026 production benchmarks show this model achieving 87% recall@10 on 50,000 technical documents.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
chunk_texts = [chunk.page_content for chunk in chunks]
embeddings = model.encode(chunk_texts, batch_size=32, normalize_embeddings=True)

The normalization step converts vectors to unit norm, letting you use dot product instead of cosine similarity same results, faster computation.

Vector database selection in 2026:

DatabaseBest fit
pgvector (PostgreSQL)Teams on Postgres; < 5M vectors
QdrantFastest self-hosted filtering; ~6ms p50 at 1M vectors
ChromaLocal prototypes, small internal tools
PineconeFully managed, auto-scaling
WeaviateRich schema, built-in hybrid search

pgvector is particularly popular for corpora under 5 million vectors, it matches Pinecone’s performance without cloud costs.

import psycopg2
from pgvector.psycopg2 import register_vector

conn = psycopg2.connect(host="localhost", database="rag_db", user="rag_user", password="***")
with conn.cursor() as cur:
    cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
    cur.execute("""
        CREATE TABLE IF NOT EXISTS documents (
            id SERIAL PRIMARY KEY, content TEXT NOT NULL,
            embedding vector(384), metadata JSONB,
            source_file VARCHAR(500), created_at TIMESTAMP DEFAULT NOW()
        )
    """)
    cur.execute("""
        CREATE INDEX documents_embedding_idx ON documents
        USING hnsw (embedding vector_cosine_ops)
        WITH (m = 16, ef_construction = 64)
    """)
conn.commit()
register_vector(conn)

Step 5: Retrieve More Than One Way

Basic semantic search embed query, find nearest neighbors, return top-k is a fine starting point. But it misses exact terms, product names, error codes, SKU numbers, and freshly coined acronyms that the embedding model has never seen.

In 2026, production teams overwhelmingly adopt hybrid retrieval: dense vector search combined with BM25 keyword search, fused via reciprocal rank fusion or weighted scoring. Redis’s February 2026 optimization guide reports hybrid search improves recall by 5-8% on technical corpora. Kapa.ai found that combining hybrid retrieval with contextual chunk enrichment reduces error rates by roughly 69%.

SELECT id, content,
    (0.7 * (1 - (embedding <=> %s))) +
    (0.3 * ts_rank(to_tsvector('english', content), plainto_tsquery('english', %s)))
    AS hybrid_score
FROM documents
WHERE to_tsvector('english', content) @@ plainto_tsquery('english', %s)
   OR embedding <=> %s < 0.5
ORDER BY hybrid_score DESC LIMIT %s;

This weights semantic similarity at 70% and keyword match at 30% tune these against your evaluation set.

Reranking is the high-leverage next step. Retrieve 20-50 candidates cheaply, then pass them through a cross-encoder reranker (Cohere Rerank 3.5, Voyage AI rerank-2.5, or BGE reranker). Cross-encoders jointly encode query and chunk through a transformer, producing a fine-grained relevance score. Kapa.ai’s data shows reranking yields a 20-30% improvement in top-k quality, and API-based reranking costs just $0.025-0.050 per million tokens trivial when applied to a small candidate set.

Step 6: Build a Source-Grounded Prompt

Your prompt is a contract. If it’s vague, the model fabricates. Be explicit:

You are answering questions using only the retrieved context below.
Rules:
- Answer ONLY from the provided context.
- If the context does not contain the answer, say: "The source material does not contain enough information."
- Cite source IDs that support each claim.
- Do not invent pricing, policy, medical advice, legal interpretations, or technical specs.

Context:
[S1] ...
[S2] ...

Question: ...

Answer (with citations):

This doesn’t magically eliminate hallucinations, but it makes expected behavior testable. If the model cites [S2] but S2 doesn’t contain the claimed fact, you have a measurable failure.

Code for prompt assembly:

from langchain.prompts import PromptTemplate
from langchain.chat_models import ChatOpenAI

template = """Answer only from the context below. If the answer is not in the context, say so. Cite source IDs.

Context:
{context}

Question: {question}

Answer (with citations):"""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def generate_answer(query, retrieved_chunks):
    context = "\n\n".join([f"[{c.metadata['source_id']}] {c.page_content}" for c in retrieved_chunks])
    return llm.invoke(prompt.format(context=context, question=query))

Step 7: Citations Users Can Actually Inspect

A good citation takes users as close as possible to the supporting evidence: page number, section, paragraph, URL anchor. Don’t settle for a document title.

More importantly: verify that the cited source actually says what the answer claims. A model can cite a relevant source while fabricating a claim that source never made this is “citation hallucination,” and it’s one of the most dangerous failure modes because citations create false trust. Test citation accuracy by having a human reviewer (or a second LLM-as-judge) verify that each cited sentence is supported by the corresponding source.

Step 8: Evaluate Systematically

Evaluation is where most teams get stuck. The demo works on five hand-picked queries and everyone celebrates. Then real users arrive with vague, misspelled, multi-part questions and the system fails silently.

Evaluate retrieval first: For each question in your test set did the correct source appear in top-3? Top-10? Were irrelevant chunks crowding out the right ones? Did metadata filters accidentally remove a relevant chunk?

Then evaluate generation: Is the answer faithful to context? Complete? Does it refuse cleanly when context is insufficient? Are citations accurate? Is latency acceptable?

The open-source RAGAS library (https://docs.ragas.io) provides standardized metrics faithfulness, answer relevancy, context precision, context recall powered by LLM-as-judge. DeepEval is a comparable alternative.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

eval_dataset = Dataset.from_dict({
    "question": ["What is the return policy for electronics?"],
    "answer": ["Electronics can be returned within 30 days with original receipt."],
    "contexts": [["Return Policy: Electronics may be returned within 30 days..."]],
    "ground_truth": ["Electronics may be returned within 30 days of purchase with receipt."]
})

results = evaluate(eval_dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall])
print(results)

However, a 2026 Future AGI analysis notes that off-the-shelf RAGAS metrics can be inconsistent between runs. The gold standard: create your own ground-truth evaluation set with manually verified Q&A pairs, combine automated metrics for scale, and include periodic human review for high-stakes domains.

Step 9: Freshness, Access Control, and Streaming

Document freshness. Nightly full re-indexing works for small bases. For living systems, implement incremental updates monitor file timestamps or version control commits and only re-embed what changed. Delete removed documents immediately. Keep versions when users may ask about past policy. Most enterprises combine frequent incremental updates with occasional full re-indexing.

Access control. If a user should not see a document, it must not be retrieved for that user. Filter at retrieval time using PostgreSQL row-level security or vector store metadata filters not after the model has seen sensitive text. Post-retrieval permission checking is a security anti-pattern.

Streaming RAG. In 2026, real-time RAG architectures are replacing batch pipelines for latency-sensitive apps. Instead of nightly indexing, document changes flow through change data capture (CDC) or event streams (Kafka, Confluent), get re-embedded on the fly, and keep the vector index continuously fresh. This is essential for customer-facing bots where stale pricing or availability answers directly lose revenue.

Step 10: Monitor Production

Log enough to debug failures: user query, retrieved chunk IDs with scores, final answer, cited sources, latency (embedding + retrieval + generation), model/embedding version fingerprints, and user feedback signals.

Watch for: stale documents surfacing, repeated unanswered questions (revealing content gaps), slow retrieval after re-indexing, citation drift after model changes, and permission filter bypasses. Observability platforms like Langfuse, Arize Phoenix, and Maxim AI provide tracing, evaluation dashboards, and production regression detection.

A Simple Architecture That Actually Works

  1. Store source docs in a version-controlled repository.
  2. Run incremental indexing on document changes.
  3. Split by heading boundaries with 500-1000 token limits.
  4. Store chunks with rich metadata in pgvector or Qdrant.
  5. Use hybrid retrieval (dense + BM25) with reciprocal rank fusion.
  6. Rerank top-20 candidates with a cross-encoder if quality needs it.
  7. Generate answers with strict source-grounding and citations.
  8. Return citations linked directly to source sections.
  9. Maintain a small, human-reviewed evaluation set run it on every change.
  10. Log everything and monitor for regressions.

That architecture will outperform a flashier system built on messy documents and zero evaluation.

Common Mistakes

Indexing dirty text bad extraction creates bad retrieval. No embedding model can save a PDF parser that jumbles columns.

Optimizing for the demo query real users type fragments, typos, and multi-language queries. If your evaluation set only has clean single-intent English questions, you’re testing a fantasy.

Skipping refusals a good RAG system says “I don’t have enough source material” instead of guessing. Refusal quality is a first-class metric.

Ignoring access control retrieval must obey the same permissions as the rest of your app. Post-retrieval filtering is a security anti-pattern.

Choosing the model over the pipeline in 2026, retrieval quality matters more than whether you use GPT-4o or Claude. Swap the LLM last, not first.

Frequently Asked Questions

What chunk size should I start with?

500 to 1000 tokens with semantic boundaries first, token limits second. Test against your evaluation set. Code, legal docs, tables, and FAQs need different rules.

Do I need a vector database?

Not for a prototype Chroma or in-memory FAISS works for initial testing. At production scale with hundreds of thousands of chunks and concurrent users, pgvector, Qdrant, or Pinecone becomes essential.

Test it. Hybrid search (dense + BM25) consistently improves results for exact product names, error codes, SKUs, legal clause references, and fresh acronyms that embedding models don’t know. The cost is minimal; the benefit is often substantial.

Is a larger context window a replacement for RAG?

Not usually. Long-context models (128K+ tokens) help, but RAG gives you retrieval precision, permission enforcement, freshness, citation grounding, and dramatically lower per-query token costs. Dumping everything into context works for small static sets; RAG is cleaner for living knowledge bases.

How do I know my RAG system is actually good?

Use a fixed evaluation set. Measure retrieval hit rate (top-3, top-10), answer faithfulness, citation accuracy, refusal quality, latency percentiles (p50, p99), and user feedback trends. If you can’t measure it, you can’t improve it.

Verified Sources