Short version
Hallucination testing checks whether an AI system produces fabricated, unsupported, or misleading content. That could mean invented facts, fake citations, wrong names, bad code suggestions, or claims that are technically possible but not supported by the source material.
OWASP treats misinformation as a core LLM risk and names hallucination as one of the major causes. That is useful because it reframes the issue. Hallucinations are not a quirky model behavior to shrug off. They are a quality and risk problem that needs deliberate testing.
OWASP puts it plainly: misinformation happens when LLMs produce false or misleading information that appears credible. Hallucinations are a major source of that problem, especially when users over-trust fluent output.
If you want the wider testing context, this page pairs well with What Is AI Testing?, How to Test LLM Applications, LLM Testing for QA Engineers, and What Is Prompt Injection?. On the standards side, the OWASP AI Testing Guide and NIST's GenAI Profile are the two most useful references to keep nearby.
Why hallucinations are hard to catch
If a regular application throws an error, you notice. If an LLM invents a package name, cites a case that does not exist, or summarizes a policy that was never in the document, the response may still read smoothly. That is what makes the problem slippery. The system fails while sounding helpful.
This gets worse when the user is in a hurry or already expects the system to know the answer. Overreliance is part of the risk. A believable wrong answer is more dangerous than a clumsy one because it travels farther before someone questions it.
What good hallucination tests actually check
A decent hallucination test does more than ask whether the answer looks smart. It checks whether the answer is supportable.
- Is the answer grounded in the source material the system was supposed to use?
- Did the model invent names, figures, or citations?
- Did it overstate confidence when the context was incomplete?
- Did it refuse or hedge when the right answer was not available?
- Did it pull the right evidence and use it accurately?
In retrieval-heavy systems, a model can retrieve the right paragraph and still answer the question badly. Or it can retrieve the wrong material and then speak very confidently about it. Both cases need their own tests.
How teams measure hallucinations
In practice, teams mix a few approaches.
Known-answer cases
Use prompts where the answer should be traceable to a trusted source set. Then check whether the response stays faithful to that material.
Unsupported claim review
Read the answer line by line and flag claims that do not have source support. This is slower, but it catches things automated checks miss.
Groundedness scoring
Score whether the final response is actually tied to the context retrieved. Some teams use automated evaluators for this. Most still keep humans involved for ambiguous cases.
Refusal testing
A strong system should know when it does not know. Good hallucination testing checks whether the model declines to answer when the context is missing or thin, instead of bluffing.
What RAG helps with, and what it does not
Retrieval-augmented generation can help reduce hallucinations because it gives the model fresh context from a trusted source. OWASP lists RAG as one mitigation for misinformation. That helps, but it does not solve the whole problem.
- Outdated model knowledge
- Missing domain context
- Answers that need source grounding
- Traceability for reviews
- Correct retrieval every time
- Faithful use of the retrieved text
- Safe answers under prompt injection
- Accurate output without review
You still have to test retrieval quality, answer faithfulness, and whether the model drifts away from the evidence. This is why hallucination testing and RAG testing overlap but are not the same job.
Where human review still matters
There are cases where the answer is too important to leave to automated scoring alone. Legal summaries. Medical guidance. Financial analysis. Code suggestions that could affect security. Anything customer-facing where a fake citation or wrong claim becomes a trust problem fast.
In those cases, the right move is often boring and sensible. Keep a human in the loop. Not for every trivial interaction, but at the approval points where a polished wrong answer could do real damage.
This is also one reason AI testing is becoming a bigger part of software testing work. As more products depend on generated output, quality means more than whether the system responded. It means whether the response deserved to be trusted.
That is also why this topic shows up in broader testing and career conversations. If you want that angle, read AI Assurance Pro for Testers, AI Assurance Pro for Managers, How AI Is Changing Software Testing, and Will AI Replace Software Testers?.