Hallucination in Legal and Regulatory Queries - The Four Mitigation Patterns

Hallucination - the AI producing plausible-sounding statements that are not supported by any source - has become the headline risk in deploying language models for regulated work. The honest position in 2026 is that hallucination has reduced dramatically in well-engineered systems but has not gone away, and the failure mode is unusually dangerous in legal and regulatory queries because the wrong answer is often the only kind of answer that looks like the right one.

This piece sets out why hallucination happens specifically in legal and regulatory contexts and the four mitigation patterns that actually work in an accounting firm. None of the four is sufficient on its own; the firms that have got the deployment right run all four in combination.

Why legal and regulatory queries are unusually hallucination-prone

There are three reasons legal and regulatory questions provoke hallucination more reliably than most other queries. First, the surface form of a legal answer is highly stylised - section numbers, case names, paragraph references - and that form is easy for a language model to imitate without the underlying substance. Second, the training data is uneven; the model has seen a great deal of plausible legal writing but not necessarily the current text of every provision it cites. Third, users tend to phrase legal questions in a way that asks for an authoritative answer, and the model is trained to be helpful, which biases it toward providing an answer even when the right answer is "I do not know".

The result is that a model asked "what does section X of the Corporation Tax Act say about Y?" will frequently produce a confident answer that looks correct, includes a section number, and is wrong. This is the hardest hallucination to catch because the consumer of the answer often does not have time to verify the citation.

Mitigation 1 - retrieval-augmented generation with strict grounding

The first and most important mitigation is RAG - retrieval-augmented generation - configured to answer only from retrieved documents. Instead of asking the model what it knows about the Corporation Tax Act, the system retrieves the actual passages of the Corporation Tax Act, passes them to the model alongside the question, and instructs the model to answer using only those passages with citations.

The mitigation works because the model is no longer drawing on its statistical memory; it is reading a passage in front of it. The remaining failure modes - misreading the passage, picking the wrong passage, summarising away an important qualifier - are real but materially smaller than free-form generation. The discipline is to insist on strict grounding (no answer beyond the retrieved passages) and to make the citation visible to the reader so verification is fast.

Mitigation 2 - forced citation per statement, not per answer

A weaker form of RAG cites the documents at the end of the answer, leaving the reader to work out which sentence came from which source. The stronger form cites each statement individually, so that the reader can verify the chain at the sentence level. This is the standard that an accounting working paper file expects, and it is the standard the AI tool should be configured to meet.

In practice this means the prompt or the system configuration includes an explicit instruction along the lines of "every statement must be supported by a citation to a specific passage in a specific document; if a statement cannot be supported, mark it as [UNSUPPORTED] and do not include it in the answer". The model that has been instructed this way produces a much more verifiable output than the model that has not.

Mitigation 3 - refusal pathway with no penalty

The third mitigation is to build a refusal pathway and to make sure the model uses it. A well-configured legal or regulatory AI tool should refuse cleanly when the answer is not in the retrieved documents, rather than producing a partial or speculative answer. The user experience should be that "I cannot answer this from the documents provided" is a normal, expected output, not a failure.

This is partly a system prompt question and partly a user training question. The user has to be comfortable with the refusal - the temptation otherwise is to keep rephrasing the question until the model produces something, which is exactly the behaviour that produces hallucinated answers. A firm that treats refusals as failures will end up with a tool that does not refuse, which is to say a tool that hallucinates.

Mitigation 4 - structured human review with citation spot-check

The fourth mitigation is procedural. Every AI output that enters a working paper file is reviewed by a qualified team member, and the review includes a spot-check of the citations. The standard is not "does the answer sound right?" - it is "does the cited passage actually say what the answer says it says?".

In practice the reviewer picks three or four citations at random and checks them. If all are correct, the rest are accepted with reasonable confidence. If one is wrong, the whole output is treated as suspect and re-done. This is the discipline that catches the residual hallucination that the technical mitigations have not eliminated.

The four together - defence in depth

No one of the four mitigations is sufficient. RAG without forced citation produces ungrounded-looking outputs. Forced citation without a refusal pathway pushes the model toward hallucinating supportable-looking citations. A refusal pathway without human review still produces wrong answers when the model produces a plausible-looking answer that it should not have. Human review without the technical mitigations is unaffordably slow.

The combination - RAG, per-statement citation, clean refusal, structured human review - produces a system in which hallucination is rare, visible when it does occur, and catchable before the output enters a client file. That is the standard a regulated firm should be insisting on from any AI tool it uses for legal or regulatory work.

What does not work

It is worth being explicit about the mitigations that do not work, despite their popularity. "Asking the model to be careful" in the prompt does not work - the model is not capable of reflecting on its uncertainty in a reliable way. "Using a more powerful model" does not work - larger models hallucinate less on average but still hallucinate on legal queries, and the user is now even more inclined to trust the output. "Putting a disclaimer at the bottom of the output" does not work - disclaimers do not prevent the output from being relied on in practice.

The only mitigations that work are structural - change what the model has access to, force how it answers, give it permission to refuse, and verify what comes out. Everything else is decoration.

What this means for tool selection

The four mitigations together produce a clear specification for selecting an AI tool for accounting work. The tool must operate in a docs-only or strictly-grounded mode over documents the firm controls. It must cite each statement to a specific passage. It must have a clear refusal pathway and use it. It must produce output in a format that supports structured human review. A tool that meets these four requirements is fit for purpose. A tool that does not, is not.

Most consumer chatbots fail at least three of the four. Most purpose-built professional services AI tools meet all four. The differential price is the differential risk.

How Accupe helps

Accupe's AI document analysis is built around these four mitigations. The docs-only mode operates strictly over the documents you have uploaded; each statement is cited back to a specific passage; the tool refuses cleanly when the answer is not in the documents; the three modes (Fast, Planning, Ultra-Detailed) and the integration with the working paper workflow make structured human review fast. The result is a tool that the firm can put on its approved list with a defensible answer to the insurer's question about hallucination control. Per-firm pricing from £20/month.