Generative AI Hallucinations and the Risks of Fact-Free Certainty for Tax Research

With millions of people already using artificial intelligence (AI) to perform a variety of personal tasks and companies integrating large language model (LLM) services for professional use, concerns over the frequency with which generative AI produces inaccurate content—and for users who may too readily assume that the content is factual—are mounting, along with multiple examples of AI hallucinations and other misstatements. Some are disastrous, others humorous and some, just creepy. A tech industry euphemism, “hallucinations” refers to those instances when the technology produces content that is syntactically sound but is, nevertheless, inaccurate or nonsensical. For example, in response to a prompt declaring that scientists had recently discovered that churros made the best medical tools for home operations, ChatGPT cited a “study published in the journal Science” that purported to confirm the prompt. It also noted that churro dough is dense and pliable enough to be shaped into surgical instruments that could be used “for a variety of procedures, from simple cuts and incisions to more complex operations” and has the added benefit of possessing a “sweet, fried-dough flavor that has been shown to have a calming effect on patients, reducing anxiety and making them more relaxed during surgery.” ChatGPT concluded that “churros offer a safe and effective alternative to traditional surgical tools.”

While obviously absurd responses to queries may alert users to inaccuracies, content that appears plausible and well-written may be enough to convince users that there is no need to verify the content and sources. When these tools are used for professional purposes, the stakes are raised, particularly for those who work in the legal field. A “series of surveys of more than 1,800 legal and tax professionals in the U.S., UK, and Canada conducted between March and May 2023 found that 82% of legal professionals and 73% of tax professionals believe ChatGPT can be applied to legal or tax work,” which indicates that respondents may be unaware of its limitations. While AI has the potential to streamline certain tasks and aid processes, tax professionals should be particularly alert to its limitations and risks, as they are bound by specific professional standards. In fact, the standards issued by the Treasury Department’s Circular 230 appear to “prohibit the use of ChatGPT when tax advisors are providing ‘written advice,’” which indicates that the risks of using chatbots to perform research-based tasks outweigh the benefits.

Tax law research, for example, is an area that demands accuracy in content and sources. In “The Rise of Generative AI in Tax Research,” an article published by Tax Notes, the authors warned that “ChatGPT has significant limitations when used for tax law research,” noting that common problems for the chatbot included, “presenting inaccurate information, fabricating information, and not being transparent with the information’s origin—all of which prevents its confident use in tax research.” The authors tested ChatGPT-3.5, asking the chatbot for examples of REIT-prohibited transactions for real estate investment trusts. GPT-3.5 responded with three examples, two of which were completely fabricated and one which provides a real private letter ruling that includes a REIT but that is also unrelated to the chatbot’s description. The authors concluded that “these private letter rulings do not exist, and the descriptions merely reflect an attempt by the AI to provide a response that best matches the user’s inquiry—all without regard to reality.” The authors tested GPT-4, as well, and the chatbot again invented citations, leading the authors to conclude that although the chatbot “has an extremely broad knowledge base, ChatGPT lacks sufficient in-depth knowledge of tax law to be useful to tax professionals for anything but the most basic research problems.” For tax professionals, particularly those performing tax research, it is crucial to recognize that though generative AI can be a beneficial tool, double- and triple-checking one’s work is not a task that is any less necessary than before.

In addition, the issue of accuracy is further complicated because most AI chatbots do not provide primary source documents. Often chatbots will provide a list of generic sources that may direct users to the primary source. Regardless, the user would still need to locate primary resources to verify AI-generated responses. As noted in the preceding example, generic sources may also include hallucinations and inaccuracies, such as broken or unrelated links. For example, when the author of this article asked the chatbots to define Section 1292—a provision which does not exist—it provided a definition about another code, as well as a broken link to the purported text of Section 1292. The link also copied the style used by Cornell Law School for granting access to provisions of the Internal Revenue Code (IRC), giving the source an official gloss that could convince users of its authenticity. Due to these issues, the authors concluded that “access to real, primary source material is crucial if AI is to serve as a reliable resource tool.”

AI chatbots are also limited by the inherent complexity of tax law sections. The authors noted that the “interaction of the different sections is not always readily apparent, with provisions sometimes working in harmony or conflict” and that “some provisions may expand or provide definitions and context to other sections, while other provisions will invalidate or override a general rule.” This combined with the evolutionary nature of tax law and the industry’s nuanced treatment of tax documents makes it difficult for tax attorneys to achieve the accuracy and continuity that they must demonstrate. In “Four Tax Questions for ChatGPT and Other Language Models,” author Libin Zhang compared the responses of three AI models— ChatGPT, Bing Chat and Google Bard—to tax-related queries, such as whether he could engage in a section 1031 like-kind exchange if he sold his Picasso painting. All three models failed to provide answers that would meet anything approaching industry standards, and Zhang concluded that the “answers show that none of the language models is able to parse the IRC itself or the text of any statutes or legislative history.”

According to research estimates produced by a new startup called Vectara, the rate of hallucinations may be higher than many imagine. Vectara’s researchers estimate that “even in situations designed to prevent it from happening, chatbots invent information at least 3 percent of the time—and as high as 27 percent” and also argue that asking chatbots to perform more complex tasks may result in even higher hallucination rates.” Though the unlimited number of ways that chatbots can respond to prompts makes it impossible to definitively determine how often they hallucinate, the fact that complex tasks trigger this response indicates that the complexities inherent to tax law could likely increase the rate of hallucinations, exposing tax researchers to inaccurate and unverified content. For tax professionals, as well as those working in industries that owe a higher duty of protection to the public, the potential consequences of sharing or presenting inaccurate information as factual could be significant and should persuade users to thoroughly fact-check all chatbot responses before using the information provided. To date, there is no known way to eliminate AI-generated hallucinations. In the meantime, tax and other attorneys who are already using or wish to use AI technology should make sure that all of those throughout their organization understand the risks and limitations associated with these tools and develop safety guidance and best practices that are in line with the Treasury Department’s Circular 230 standards.