The Hallucination Problem: Why LLMs Confabulate and What We Can Do About It

The Hallucination Problem: Why LLMs Confabulate and What We Can Do About It

11 min read
IQLAS

In early 2023, a New York lawyer submitted a legal brief citing six cases provided by ChatGPT. None of the cases existed. The citations were plausible — correct court format, realistic docket numbers, credible-sounding summaries — and entirely fabricated. The lawyer had not verified them. The judge was not amused.

This incident became a touchstone example of what the AI community calls hallucination — the tendency of large language models to generate confident, fluent, plausible-sounding text that is factually false. It is one of the most significant barriers to the reliable deployment of these systems in high-stakes domains. It is also widely misunderstood, which makes it harder to address.

What Hallucination Actually Is

The term “hallucination” is borrowed from neuroscience, where it refers to perception without a corresponding external stimulus. Applied to language models, it has become a catch-all for various failure modes that are actually quite distinct:

Factual errors: The model states something false about the world — wrong dates, nonexistent events, incorrect numerical values, fictional citations.

Intrinsic hallucination: The model generates content that contradicts the source material it was asked to summarize or analyze.

Extrinsic hallucination: The model adds information not supported by the source material — speculating beyond what was provided.

Knowledge cutoff errors: The model asserts things that were true during training but are no longer true.

Confidence miscalibration: The model expresses high certainty about things it shouldn’t be certain about, or low certainty about things it should know well.

Understanding which type of hallucination is occurring matters for choosing the right mitigation strategy.

Why Models Hallucinate: A Mechanistic Perspective

The question “why do language models hallucinate?” has a technically nuanced answer that is worth working through.

Training Objective Misalignment

Language models are trained with a next-token prediction objective: given a sequence of tokens, predict the next one. This objective is extraordinarily well-specified for generating fluent, coherent text — text where each token is statistically consistent with what came before.

But fluency and factual accuracy are not the same objective. A fluent continuation of “The treaty was signed in” might be “Vienna in 1815” (true) or “Paris in 1823” (false) — both are equally plausible to the next-token predictor if it has insufficient confidence to distinguish them. The training signal does not penalize plausible-but-false completions differently from true completions.

This is a fundamental misalignment. The model learns to produce tokens that look like factual text more than it learns to produce factually accurate text.

Knowledge Representation

Knowledge in a large language model is distributed across billions of parameters. There is no lookup table, no database, no explicit knowledge store. Facts are encoded implicitly in weights, entangled with each other, the contexts in which they appeared, and the statistical patterns of language itself.

When a model “knows” that the capital of France is Paris, it is not looking that fact up in a table. It is generating a response in which “Paris” receives high probability given the context of the question — because that association was reinforced many times during training. When a model “recalls” a fictional legal case, it is generating something that has high probability given a context asking for legal citations, matching the statistical patterns of real citations in the training data, without any reliable mechanism to check whether the specific case actually exists.

The absence of a separation between “things the model has memorized reliably” and “things the model is generating plausibly” is a core architectural contributor to hallucination.

The Training Data Problem

Models are trained on internet-scale corpora that contain enormous amounts of false, misleading, and inconsistent information alongside accurate information. False claims appear in training data. Contradictory statements appear, training the model simultaneously on both. The model learns the statistical pattern of saying things that sound like facts, whether or not those things are facts.

Deliberate attempts to address this through RLHF (Reinforcement Learning from Human Feedback) and similar alignment techniques improve calibration significantly — models become better at expressing uncertainty, better at refusing to answer when they don’t know, and better at flagging potential unreliability. But RLHF introduces its own distortions, and the fundamental issue — that the model has no ground truth to consult — remains.

What Actually Helps

The good news is that hallucination rates are not fixed. They vary substantially by domain, by prompting strategy, by architecture, and by the supplementary systems wrapped around the model.

Retrieval-Augmented Generation

RAG (Retrieval-Augmented Generation) is currently the most effective practical mitigation for factual hallucination in deployed systems. The insight is simple: don’t ask the model to retrieve facts from its weights; give it the facts in its context and ask it to reason over them.

In a RAG system, a query is used to retrieve relevant documents from an external knowledge base (a vector database, a document store, a search index containing verified information). Those documents are injected into the model’s context, and the model is instructed to answer based only on the provided documents.

This converts the problem from “can the model accurately recall this?” (which it may not be able to do reliably) to “can the model accurately extract and synthesize this?” (which it does much better). The model’s role shifts from rememberer to reasoner.

RAG is not a complete solution. The model can still misread or mischaracterize the provided documents. The retrieval step can fail to surface the relevant document. The model can still add information not in the retrieved context if not carefully constrained. But RAG dramatically reduces factual hallucination for queries within the knowledge base’s coverage.

Prompting Strategies

How you prompt a model meaningfully affects hallucination rate.

Encourage uncertainty expression. Explicitly instruct the model to say “I don’t know” or “I’m not sure” rather than guessing. Default model behavior tends toward confident completion; explicit instruction shifts the prior.

Ask for sources. Requesting that the model cite sources or explain its reasoning forces it into a mode where it must generate not just a claim but a justification. This doesn’t guarantee accuracy — the model can hallucinate sources — but it surfaces reasoning that can be verified.

Chain-of-thought prompting. Asking the model to “think step by step” before answering reduces hallucination on tasks requiring multi-step reasoning. The intermediate steps create a constraint — each step must be consistent with the next — that reduces the probability of high-confidence wrong final answers.

Temperature reduction. Lower temperature in text generation reduces randomness in token selection, leading to more conservative completions that hew closer to high-probability training patterns. This reduces creative hallucination at the cost of reduced diversity.

Model Selection and Fine-Tuning

Hallucination rates vary significantly across models and model families. Larger models in a given family hallucinate less than smaller ones, on average. Models specifically trained for factual accuracy (often involving deliberate RLHF rewards for uncertainty expression and factual verification) perform better than general-purpose models.

Domain-specific fine-tuning on high-quality curated data can significantly reduce hallucination for knowledge within the fine-tuning domain, at the risk of increasing it for out-of-domain queries.

Output Verification and Double-Checking

Perhaps the most robust approach for high-stakes domains is treating model outputs as drafts requiring verification rather than ground truth. This is a process solution rather than a technical one.

Downstream automated verification — checking generated citations against databases, running generated code in a sandbox, comparing generated numerical outputs against calculated values — can catch a large fraction of factual errors before they reach users.

The medical AI literature is beginning to establish structured approaches to this: models that produce diagnoses or treatment suggestions must link each claim to evidence, and a separate verification system checks those links. This mirrors how medical evidence-based practice works — claims must be grounded in cited sources.

The Deeper Problem: Calibration

Perhaps more dangerous than the hallucinations themselves is the confidence with which they are delivered. A model that says “I think the capital of Australia might be Sydney, but I’m not certain” is far safer than one that confidently states “The capital of Australia is Sydney” without hedging.

Calibration refers to the alignment between a model’s expressed confidence and its actual accuracy. A well-calibrated model that claims 80% confidence in a statement should be right about 80% of the time across all such statements.

Current language models are often poorly calibrated — particularly in domains where they have sparse training data. They express similar confidence in things they reliably know and things they are essentially guessing. This is particularly problematic because human users tend to update heavily on apparent confidence; a confidently-stated falsehood is more persuasive than a hesitantly-stated truth.

Improving calibration is an active research area. Techniques include training specifically on calibration objectives, post-hoc probability calibration on model outputs, and architectures that maintain explicit uncertainty estimates rather than collapsing uncertainty into the generation distribution.

Accepting the Limitation

There is a temptation in AI deployments to treat hallucination as a solvable bug — something that will be fixed in the next model version, making the entire problem go away. This is probably not the right frame.

Language models are, at a fundamental level, probability machines over language. Their reliability depends on the statistical structure of their training data and the alignment between that structure and factual reality. The more a truthful answer looks statistically like truthful training text, the better the model performs. The more the model is asked to reason about domains with thin or contradictory coverage in training data, the more it will confabulate plausible-sounding responses.

This does not mean hallucination cannot be reduced — it can, substantially, through the mechanisms described here. It means the expectation that it can be eliminated is probably wrong, and system design should reflect that.

The most reliable systems will be those that treat the language model as one component in a layered architecture — responsible for language understanding and generation, but not for ground truth — and build verification, retrieval, and human oversight into the overall system design.

The lawyer who submitted the fictional cases was not wrong to use an AI assistant for legal research. He was wrong to submit unverified output as legal fact. That distinction — between a useful tool and an authoritative source — is the essential one.