Two papers matter here — Xu et al. (arXiv 2401.11817) and Banerjee et al. (arXiv 2409.05746). Together they establish the same thing from different angles.
The core proof (Xu et al.):
Under their formal definition, hallucination is inevitable for any computable LLM, regardless of model architecture, learning algorithms, prompting techniques, or training data.
The proof has three steps:
Theorem 1: All currently proposed LLMs are polynomial time bounded, meaning they are contained in a computably enumerable set of LLMs.
Every LLM has to produce an answer in a reasonable amount of time. When you send a prompt, you don't wait a year for a response. That time constraint — the fact that the model must finish within a practical window — is what "polynomial time bounded" means. The model can only perform a certain number of computational steps before it has to output something. Some problems require more steps than that ceiling allows. When the problem exceeds the ceiling, the LLM either times out or guesses. That guess is a hallucination. Because every LLM must operate within a time budget, and because you can enumerate every possible LLM that respects that budget, you can mathematically prove that none of them — not a single one, ever built — can cover all the ground truth. The hallucination gap isn't an engineering problem. It's a property of the list itself.
Theorem 2: Even when removing the P-provability constraint, all LLMs in a computably enumerable set will hallucinate on infinitely many inputs.
The first version of the proof started with a restricted assumption: LLMs that can prove their own outputs are correct before returning them. Think of it as a model that checks its work internally and only outputs answers it can mathematically verify. That's the P-provability constraint. It's an idealized, more capable version of an LLM than actually exists — one with a built-in self-verification step.
Theorem 2 says: fine, forget the self-verification requirement. Drop it entirely. Consider any LLM — verified or not, self-checking or not, simple or state-of-the-art. The widest possible definition of what an LLM can be.
The result is the same. Every LLM in that broader set still hallucinates on infinitely many inputs.
Even that idealized model hallucinated. That was Theorem 1.
Someone could read Theorem 1 and argue: "okay, but what if we build a better model — one that verifies its outputs before returning them?" Theorem 2 answers that directly. The verification capability doesn't matter. The hallucination isn't coming from a failure to check the work. It's coming from a fundamental mismatch between what any LLM can compute and what ground truth requires.
Theorem 1 proved the weakest LLMs hallucinate. Theorem 2 proved all LLMs hallucinate — no matter how capable, no matter how much self-checking you add. The problem isn't fixable by making the model smarter at verifying itself. The gap between what an LLM can produce and what is actually true is permanent and infinite, regardless of architecture.
Theorem 3: Every computable LLM will hallucinate on infinitely many inputs, regardless of its architecture, training, or any other implementation detail. Even with advanced techniques like prompt-based methods, LLMs cannot completely eliminate hallucinations.
What is a "computable LLM"?
Every LLM that has ever been built or ever will be built is computable. It means the model runs on a computer, takes an input, and produces an output through a defined mathematical process. If it runs on hardware, it's computable. This isn't a special category — it's just the researchers being precise. They're saying: any LLM you could actually deploy in the real world.
"Infinitely many inputs"
This is the part that stings. The proof doesn't say an LLM will hallucinate occasionally, or frequently, or even most of the time. It says for every LLM ever built, there exists an infinite collection of inputs on which it will be wrong. Not a finite list of edge cases you could patch. An infinite one. Fix the hallucination on input A, the gap still exists on inputs B, C, D, and infinitely more you haven't found yet.
"Infinitely many inputs"
This is the part that stings. The proof doesn't say an LLM will hallucinate occasionally, or frequently, or even most of the time. It says for every LLM ever built, there exists an infinite collection of inputs on which it will be wrong. Not a finite list of edge cases you could patch. An infinite one. Fix the hallucination on input A, the gap still exists on inputs B, C, D, and infinitely more you haven't found yet.
It gets worse. Beyond computation limits, AI systems face a second problem rooted in Gödel's Incompleteness Theorem — the same mathematical principle that proves no logical system can fully verify itself. An LLM cannot predict whether its own output will be correct before it generates it. arXiv
It has no internal alarm that fires before it invents a wrong answer.
And fact-checking doesn't close the gap — no fact-checking process is complete in a finite number of steps. ResearchGate
You can always add more checks. You can never add enough.
Every LLM you can build — on any hardware, with any data, using any technique — will be wrong on an infinite number of possible inputs. There is no architecture that escapes this. There is no training run that closes the gap. There is no prompting technique that eliminates it. This isn't a flaw in today's models. It's a property of all models, forever. The only honest response is to stop trying to fix the LLM and start engineering the system around it.
That's exactly what multi-agent architecture, grounding, and constrained output structures do. They don't fix the LLM. They make the system reliable despite the LLM.
Because the theorems prove hallucination is inevitable over an infinite input space. Real deployments don't have infinite input spaces. They have bounded ones.
A SDR AI that handles researching and writing relevant sales sequences isn't being asked to solve every possible problem. It's being asked to handle a finite, predictable category of research and customization. The math still applies — but the hallucination surface shrinks to the point of practical irrelevance when you engineer the task correctly.
This is the insight most AI vendors won't say out loud: the problem was never whether the model hallucinates. It's whether the model is ever asked to do something it shouldn't be doing in the first place.
Context, guardrails, and agent architecture are the three engineering mechanisms that solve this. Here's how each one works in business terms.
Context (RAG, grounding, retrieval): Narrows the output space by anchoring generation to a verified document or data source. Instead of the LLM sampling from the full probability distribution of all it's ever seen, it samples conditioned on retrieved ground truth. This directly attacks the root cause — the gap between the model's compressed parametric knowledge and actual ground truth functions. It doesn't prevent hallucination on tasks outside the retrieved context, but within-domain accuracy goes up dramatically.
Guardrails and fences: These work at the output layer. Structural guardrails (output schemas, format enforcement, constrained decoding) eliminate an entire class of hallucinations by removing the model's ability to generate outside a defined response space. Semantic guardrails (classifiers, NLI checks, consistency validators) catch outputs that contradict grounded facts. The limitation: guardrails can only block what they're designed to check. They don't fix the generative process — they filter its output.
Agent architecture (tool use, multi-agent verification): This is the most powerful mitigation class. Decomposing complex tasks into sub-agents with specialized scopes reduces each agent's exposure to hallucination-prone problem classes. Tool calls externalize computation — instead of the LLM computing an answer, it calls a calculator, database, or API where the ground truth exists outside the model's parameters. Multi-agent debate and verification architectures catch inconsistencies by having multiple agents cross-check outputs. None of this eliminates hallucination per the theorems, but it systematically removes the high-hallucination problem categories from the model's responsibility.
The core insight: these mechanisms work by shrinking the problem space each model call is responsible for. The theorems prove hallucination is inevitable over an infinite input space. Practical deployment is about engineering a finite, bounded task space where the model operates, and routing everything outside that space to external systems.
The paper's practical recommendation is exactly what EverWorker builds: external checks, bounded scopes, and constrained problem spaces. Here's how EverWorker's three layers map to the theory.
Instruction Sets — eliminating behavioral hallucination
Most enterprise AI failures aren't factual hallucinations — they're behavioral ones. The model does something outside its intended scope, takes an action it wasn't supposed to take, or interprets ambiguous instructions wrong. EverWorker instruction sets establish explicit behavioral contracts: what the worker does, what it doesn't do, how it handles edge cases, and what failure mode it defaults to. This is fence-and-guardrail logic applied at the worker definition layer rather than the output filter layer. The constraint happens before generation, not after.
The key leverage: instruction sets define the bounded problem space the theorems require. A worker told "score inbound leads against these 5 criteria using this rubric, return JSON matching this schema" is operating in a fundamentally different (and safer) problem space than a general assistant asked to evaluate leads. The hallucination surface shrinks to near-zero because the task is no longer general-purpose.
Agent Architecture Templates — eliminating structural hallucination
This is where EverWorker addresses the paper's central argument directly. Rather than using a single monolithic LLM call to handle a complex task, EverWorker's multi-agent templates decompose tasks into specialist workers with narrow scopes. A researcher worker retrieves and cites. A writer worker formats and structures. A validator worker cross-checks. An executor worker takes action.
Each worker in the pipeline only handles the problem class it's been optimized for. The researcher doesn't generate — it retrieves. The validator doesn't create — it checks. This maps directly to the paper's own hallucination-prone taxonomy: tasks that exceed a single polynomial-time model call get broken into subtasks that each fall inside the hallucination-safe zone. Tool use (web search, CRM API, database query) externalizes the ground truth computation entirely, removing the model from the loop for factual claims.
Knowledge Engine — eliminating factual hallucination
The knowledge engine is the RAG layer, but operationalized at scale. By grounding every worker call in indexed, versioned, company-specific knowledge, EverWorker removes the reliance on the model's parametric memory — the exact failure mode the theorems describe. The model doesn't need to "know" the answer from training; it retrieves it from a verified source and generates around that grounded anchor.
This directly addresses Xu et al.'s framing: the gap between the LLM's computable function and the ground truth function closes when the ground truth is injected into context. The model isn't being asked to reproduce a computable function it never saw — it's being asked to format and reason over a document that contains the answer.
The theorems are correct. No amount of instruction tuning, RLHF, or architecture improvement eliminates hallucination over an infinite input space. That's not a bug in EverWorker's approach — it's the premise of it.
EverWorker's design philosophy accepts the theorem and builds accordingly: workers are never deployed as general problem solvers. They operate inside engineered task boundaries, grounded knowledge bases, structured output schemas, and multi-agent verification loops. The infinite hallucination space the theorems describe never gets triggered because no EverWorker is ever asked to be a general-purpose oracle.
We architect around the mathematical guarantees so hallucination-prone problem classes never reach the model.