Retrieval-Augmented Generation: Building LLMs That Know What They Don't Know

The fundamental tension in deploying large language models for knowledge-intensive tasks is this: LLMs know an enormous amount, but what they know is frozen at training time, difficult to update, and unreliably recalled. Retrieval-Augmented Generation is the practical resolution of this tension — a pattern that gives models access to external, authoritative, up-to-date knowledge without the cost of retraining.

The idea is elegant and the tutorial version is simple. The production version is significantly more complex. This post covers both — with honest attention to the gaps between them.

The Architecture, Simply

In a basic RAG system:

A user asks a question
The question is used to retrieve relevant documents from a knowledge base
The retrieved documents and the original question are combined into a prompt
An LLM generates an answer based on the prompt, grounded in the retrieved documents

This is not conceptually novel — information retrieval has existed for decades. What RAG adds is the language model’s ability to synthesize, interpret, and respond in natural language based on retrieved material, rather than simply returning matched documents.

The knowledge base can be anything: a corpus of company documentation, a legal code, medical literature, a support ticket history, a product catalog. The key is that it is external to the model weights — verifiable, updatable, and owned by the deployer.

Step 1: Document Ingestion and Chunking

Before any queries can be answered, documents must be processed into a form the retrieval system can use. This involves chunking — splitting documents into smaller pieces — and embedding — converting text into dense vector representations.

Chunking Strategy Matters More Than You Think

The default chunking approach — split on a fixed number of tokens, perhaps with some overlap — works adequately and fails predictably. The failures are specific:

Semantic splitting is better. Cutting in the middle of a table, mid-paragraph, or between a header and its associated content produces chunks that are either incomplete or useless for answering questions about that content. Chunk on semantic boundaries: paragraphs, sections, table rows.

Chunk size affects retrieval quality. Small chunks (100 tokens) contain precise information but may lack context. Large chunks (1000 tokens) preserve context but reduce retrieval precision — you might retrieve a chunk that contains the answer buried in largely irrelevant material. The optimal size depends on your content type and query patterns. Experiment rather than assume.

Hierarchical chunking. An increasingly effective strategy: create chunks at multiple granularities (paragraph-level and section-level) and embed both. Retrieve at the fine-grained level for precision, but return the parent section for context. This “small-to-big” retrieval pattern significantly improves answer quality when the precise information needs surrounding context to be intelligible.

Preserve metadata. Every chunk should carry metadata: source document, section title, date, author, page number. This metadata enables filtering, attribution, and debugging — all of which you will need in production.

Step 2: Embedding and Indexing

Embedding converts text into vectors in a high-dimensional space where semantically similar text maps to nearby points. A question about “aircraft engine maintenance” should retrieve text about turbofan service intervals even if none of those exact words appear in the question.

Choosing an Embedding Model

Several considerations:

Domain alignment. General-purpose embedding models (OpenAI text-embedding-3-large, nomic-embed-text, bge-large) perform well across domains. For highly specialized domains (legal, medical, technical), embedding models fine-tuned on domain-specific text can improve retrieval precision significantly.

Dimensionality vs. cost. Higher-dimensional embeddings capture more semantic nuance at the cost of storage and similarity search time. OpenAI’s text-embedding-3-large is 3072 dimensions; text-embedding-3-small is 1536. The small model is often sufficient and substantially cheaper.

Multilingual requirements. If your knowledge base or queries involve multiple languages, verify your embedding model supports them. Many models are English-dominant and perform poorly on other languages.

Query-document asymmetry. Good retrieval embedding models are trained to capture the relationship between short queries and longer documents, not just document-to-document similarity. bge models, E5 models, and similar explicitly trained on retrieval tasks outperform symmetric semantic similarity models for RAG.

Vector Databases

The embedding vectors are stored in a vector database that supports approximate nearest-neighbor (ANN) search — finding vectors close to a query vector in high-dimensional space efficiently.

Hosted solutions (Pinecone, Weaviate Cloud, Qdrant Cloud): Managed infrastructure, simple API, no operational burden. Appropriate for most production deployments where operational simplicity is a priority.

Self-hosted (Chroma, Weaviate, Qdrant, pgvector): More control, no vendor lock-in, can be co-located with your data for compliance. pgvector as a PostgreSQL extension is an excellent choice for organizations already running PostgreSQL and wanting to minimize infrastructure components.

Metadata filtering. Production RAG systems almost always need to filter by metadata — retrieve only documents from the last 6 months, only documents tagged for a specific department, only documents in a specific language. Verify that your vector database supports metadata filtering efficiently (not all do — some filter post-retrieval, which defeats the purpose).

Step 3: The Retrieval Process

Given a user query, the retrieval process finds the most relevant chunks. The naive version — embed the query, find the k nearest vectors — works and is a reasonable starting point. The production version adds several important refinements.

Query Expansion and Rewriting

Raw user queries are often bad retrieval queries. They are short, ambiguous, potentially misspelled, and may use different terminology than the documents. Query rewriting improves this:

HyDE (Hypothetical Document Embeddings). Generate a hypothetical ideal answer to the question using the LLM, then embed that hypothetical answer rather than the query itself. The hypothetical answer uses the vocabulary and style of the knowledge base, dramatically improving semantic matching.

Query expansion. Ask the LLM to produce multiple alternative phrasings of the query and retrieve against all of them, then merge and deduplicate results.

Sub-query decomposition. Complex multi-part questions should be decomposed into simpler sub-queries, each retrieving independently.

Hybrid Search

Semantic search (vector similarity) and keyword search (BM25/TF-IDF) capture different aspects of relevance. Semantic search finds contextually related content even when exact words differ. Keyword search excels at precise term matching — particularly for proper nouns, technical terms, model numbers, and other tokens that carry meaning in their exact form.

Production systems consistently outperform pure semantic retrieval by adding BM25-based keyword search and combining the two result sets with Reciprocal Rank Fusion (RRF) or a learned ranker.

Reranking

After retrieving the top-k candidates from initial search, re-ranking with a cross-encoder model dramatically improves the precision of what actually gets passed to the LLM. Cross-encoders (like cross-encoder/ms-marco-MiniLM or Cohere Rerank) evaluate a query-document pair jointly, capturing interactions that embedding-based similarity cannot.

A typical production pipeline:

Initial retrieval: top 20-50 candidates via hybrid search (fast, approximate)
Reranking: top 20-50 → top 5 using cross-encoder (slower, precise)
Pass top 5 to LLM for synthesis

Step 4: Generation with Grounding

The retrieved chunks, the user query, and the system prompt are combined into a context and passed to the LLM. The system prompt instructs the model to answer based only on the provided context, to cite its sources, and to indicate when the provided context does not contain sufficient information to answer the question confidently.

You are a helpful assistant. Answer the user's question based solely on the provided context documents. 
If the documents do not contain sufficient information to answer the question, say so explicitly rather than guessing.
For every factual claim in your answer, cite the source document(s) using the document IDs provided.

Context:
[DOC-1 | Source: maintenance-manual-v3.pdf | Section: 7.3]: ...
[DOC-2 | Source: service-bulletin-2024-112.pdf]: ...

User question: ...

Context window management. LLMs have finite context windows. As retrieval improves (retrieving more documents) and documents grow longer, context overflow becomes an issue. Strategies include: increasing chunk precision to retrieve smaller, more relevant chunks; using models with larger context windows (Gemini 1.5 supports 1M tokens); and selective inclusion based on reranking scores.

Long context pitfalls. Research suggests LLMs have a “lost in the middle” problem — they attend better to content at the beginning and end of long contexts, underweighting content in the middle. Keep critical information outside the middle of very long context prompts, or use models specifically fine-tuned for long-context faithfulness.

Evaluation: The Hardest Part

RAG systems are difficult to evaluate rigorously. The failure modes are subtle: the retrieval step might return relevant-but-not-quite-right documents, the LLM might accurately synthesize those documents into a plausible-but-wrong answer, or the LLM might correctly answer despite poor retrieval due to its own parametric knowledge.

RAGAS (RAG Assessment) is a framework providing automated metrics across several dimensions:

Faithfulness: Are the claims in the answer supported by the retrieved context?
Answer relevance: Does the answer actually address the question?
Context precision: How relevant are the retrieved chunks?
Context recall: Were all relevant chunks retrieved?

These metrics require a golden dataset of question-answer-context triples — expensive to create, but essential for confident deployment. A RAG system that performs well on anecdotal tests can fail systematically on specific query types and the only way to know is to measure.

What Production Actually Looks Like

A production-grade RAG system significantly exceeds the tutorial architecture:

Ingestion pipeline: Document monitoring for new/updated content, chunking and embedding with error handling and retry, metadata extraction, incremental indexing.

Query pipeline: Query rewriting, hybrid search, reranking.

Generation pipeline: Prompt construction with context management, citation extraction from generated text, answer post-processing.

Monitoring: Query logging (with appropriate privacy handling), retrieval quality metrics, answer quality sampling, latency tracking, cost tracking (API calls to embedding/generation models are real money at scale).

Feedback loops: Mechanism to capture user signals (thumbs up/down, citations checked) and use them to improve retrieval and generation over time.

None of these components are optional in a production system. The tutorial RAG works for demos. The full architecture above works for products.

When RAG Is Not the Answer

RAG is effective when queries require factual recall from a known knowledge base. It is less applicable when:

The query requires reasoning, not retrieval. Complex multi-step logical derivations or mathematical reasoning require the LLM’s parametric capabilities, not retrieved context. RAG adding irrelevant chunks can actually hurt performance on pure reasoning tasks.

The knowledge base changes faster than retrieval can track. RAG latency is real — retrieval, reranking, and generation add up. For very real-time queries (financial tick data, live sensor feeds), streaming inference with inline context may be more appropriate than retrieval.

The domain is well-represented in the model’s training data and accuracy requirements are not extreme. For many general-knowledge queries, a capable base model without RAG produces acceptable answers more cheaply. The cost and complexity of RAG infrastructure must be justified by the accuracy and reliability requirements of the application.

RAG is a powerful architectural pattern for making LLMs reliable in production. It is also work — significant engineering work to implement well and ongoing operational work to maintain. The gap between the tutorial and production versions is wide. Understanding that gap before you commit to the architecture is the first step toward closing it successfully.