Large language models (LLMs) and generative AI are fundamentally changing the way businesses operate — and how they manage and use information. They’re ushering in efficiency gains and qualitative improvements that would have been unimaginable only a few years ago.
But all this progress comes with a caveat. Generative AI models sometimes hallucinate. They fabricate facts, deliver inaccurate assertions and misrepresent reality. The resulting errors can lead to flawed assessments, poor decision-making, automation errors and ill will among partners, customers and employees.
“Large language models are fundamentally pattern recognition and pattern generation engines,” points out Van L. Baker, research vice president at Gartner. “They have zero understanding of the content they produce.”
Adds Mark Blankenship, director of risk at Willis A&E: “Nobody is going to establish guardrails for you. It’s critical that humans verify content from an AI system. A lack of oversight can lead to breakdowns with real-world repercussions.”
False Promises
Already, 92% of Fortune 500 companies use ChatGPT. As GenAI tools become embedded across business operations — from chatbots and research tools to content generation engines — the risks associated with the technology multiply.
“There are several reasons why hallucinations occur, including mathematical errors, outdated knowledge or training data and an inability for models to reason symbolically,” explains Chris Callison-Burch, a professor of computer and information science at the University of Pennsylvania. For instance, a model might treat satirical content as factual or misinterpret a word that can have different contexts.
Regardless of the root cause, AI hallucinations can lead to financial harm, legal problems, regulatory sanctions, and damage to trust and reputation that ripples out to partners and customers.
In 2023, a New York City lawyer using ChatGPT filed a lawsuit that contained egregious errors, including fabricated legal citations and cases. The judge later sanctioned the attorney and imposed a $5,000 fine. In 2024, Air Canada lost a lawsuit when it failed to honor the price its chatbot quoted to a customer. The case resulted in minor damages and bad publicity.
At the center of the problem is the fact that LLMs and GenAI models are autoregressive, meaning they arrange words and pixels logically with no inherent understanding of what they are creating. “AI hallucinations, most associated with GenAI, differ from traditional software bugs and human errors because they generate false yet plausible information rather than failing in predictable ways,” says Jenn Kosar, US AI assurance leader at PwC.
The problem can be especially glaring in widely used public models like ChatGPT, Gemini and Copilot. “The largest models have been trained on publicly available text from the Internet,” Baker says. As a result, some of the information ingested into the model is incorrect or biased. “The errors become numeric arrays that represent words in the vector database, and the model pulls words that seems to make sense in the specific context.”
Internal LLM models are at risk of hallucinations as well. “AI-generated errors in trading models or risk assessments can lead to misinterpretation of market trends, inaccurate predictions, inefficient resource allocation or failing to account for rare but impactful events,” Kosar explains. These errors can disrupt inventory forecasting and demand planning by producing unrealistic predictions, misinterpreting trends, or generating false supply constraints, she notes.
Smarter AI
Although there’s no simple fix for AI hallucinations, experts say that business and IT leaders can take steps to keep the risks in check. “The way to avoid problems is to implement safeguards surrounding things like model validation, real-time monitoring, human oversight and stress testing for anomalies,” Kosar says.
Training models with only relevant and accurate data is crucial. In some cases, it’s wise to plug in only domain-specific data and construct a more specialized GenAI system, Kosar says. In some cases, a small language model (SLM) can pay dividends. For example, “AI that’s fine-tuned with tax policies and company data will handle a wide range of tax-related questions on your organization more accurately,” she explains.
Identifying vulnerable situations is also paramount. This includes areas where AI is more likely to trigger problems or fail outright. Kosar suggests reviewing and analyzing processes and workflows that intersect with AI. For instance, “A customer service chatbot might deliver incorrect answers if someone asks about technical details of a product that was not part of its training data. Recognizing these weak spots helps prevent hallucinations,” she says.
Specific guardrails are also essential, Baker says. This includes establishing rules and limitations for AI systems and conducting audits using AI augmented testing tools. It also centers on fact-checking and failsafe mechanisms such as retrieval augmented generation (RAG), which comb the Internet or trusted databases for additional information. Including humans in the loop and providing citations that verify the accuracy of a statement or claim can also help.
Finally, users must understand the limits of AI, and an organization must set expectations accordingly. “Teaching people how to refine their prompts can help them get better results, and avoid some hallucination risks,” Kosar explains. In addition, she suggests that organizations include feedback tools so that users can flag mistakes and unusual AI responses. This information can help teams improve an AI model as well as the delivery mechanism, such as a chatbot.
Truth and Consequences
Equally important is tracking the rapidly evolving LLM and GenAI spaces and understanding performance results across different models. At present, nearly two dozen major LLMs exist, including ChatGPT, Gemini, Copilot, LLaMA, Claude, Mistral, Grok, and DeepSeek. Hundreds of smaller niche programs have also flooded the app marketplace. Regardless of the approach an organization takes, “In early stages of adoption, greater human oversight may make sense while teams are upskilling and understanding risks,” Kosar says.
Fortunately, organizations are becoming savvier about how and where they use AI, and many are constructing more robust frameworks that reduce the frequency and severity of hallucinations. At the same time, vendor software and open-source projects are maturing. Concludes Blankenship: “AI can create risks and mitigate risks. It’s up to organizations to design frameworks that use it safely and effectively.”