I just got back from Digital CPA, a conference for the most forward thinking accounting firms and technology solutions in the market. 2026 will be the year AI went from interesting to implemented, because now there is reasoning capabilities, corporate budget and growing interest.
But it has been a slow start for accounting firms and AI adoption against some other industries. And this makes sense, an accountant’s livelihood and reputation depends on the accuracy of financial statements. It’s what they are trained in, and what they are paid to do. So how do we reconcile that against AI hallucinations?
On my flight home I was listening to an episode of the 20MinVC where they were talking about Harvey, a leading AI legal startup who recently raised $160M (according to the podcast with $150M ARR, growing 300% YoY, 98% GDR, 168% NDR) and what else do you do on a middle seat of a cross country flight home? Exactly, open up deep research and dive down the rabbit hole of legal AI, to see if there were some interesting parallels between Harvey, Ironclad and many new YC companies such as Legora, Spellbook, Crimson, Blueshoe, and more. I tried to cite sources when possible, but understand the market is changing fast, and studies themselves could have bias based upon funders, institutions and writers. So take this as a starting point, rather than an exhaustive study.
Artificial Intelligence is unquestionably transforming how professionals' industries work. From automating mundane tasks to summarizing complex documents or scenarios, AI seems like a superpower at our fingertips. But this new capability comes with a double-edged sword: the very factors that make AI useful, such as its pattern recognition and generative language, also make it dangerously prone to hallucinations and bias. In high-stakes domains like law, these limitations aren’t just technical bugs, they can lead to incorrect legal conclusions with real consequences.
This is a challenge for builders in the space, balancing opinionated vs flexible solutions in complicated industries where users might not be as experienced in hallucinations and prompting as say engineers for code generation or sales departments for prospecting.
In my post, I wanted to explore two major challenges:
We’ll also draw on research to illustrate why these problems matter and how to mitigate them.
One of the most well-documented issues in generative AI is hallucination: the phenomenon where an AI system produces information that sounds plausible but is inaccurate, fabricated, or misleading.
In a study by Stanford’s Human-Centered Artificial Intelligence (HAI) initiative, researchers benchmarked several legal AI tools and found that they hallucinated in at least 1 out of every 6 queries — even when the queries weren’t ambiguous or opinion-based. In other words, roughly 17% (or more) of the time, the models produced incorrect legal information or cited incorrect or made-up sources.
The study specifically targeted AI tools designed for legal research from major providers. Despite being positioned as specialized tools to assist lawyers, the outputs still contained a significant rate of misinformation, from incorrect answers to wrongful citations.
This isn’t just an academic concern: there are reported instances in court proceedings where attorneys (or litigants) submitted AI-generated legal citations that didn’t exist, which judges have sanctioned or criticized.
Now imagine for the non-experts who rely on AI for expert legal, accounting or tax advice? It can provide the illusion of certainty, which is dangerous for the average person who does not understand concepts of precedent or jurisdiction in legal, or concepts like reconciliation in accounting.
AI models are trained to find patterns in massive text datasets and then generate plausible continuations, they don’t verify truth. This means they can combine facts incorrectly, invent sources, or incorrectly “remember” information from training data.
From a business perspective, this highlights a fundamental truth: AI is impressive at sounding confident but it doesn’t “know” what’s true. Without human verification, the information produced can be misleading or outright wrong.
Another risk is less about AI errors and more about how those errors are encouraged by the way we ask questions.
Prompt bias happens when we frame a query because of the assumptions embedded in the question, which end up nudging the AI toward a specific (potentially biased) conclusion. It’s a bit like asking a biased interview question: the answer will reflect the question, not the reality.
This aligns with a broader understanding of bias in AI, where systems reflect not only patterns in the data but also the assumptions baked into that data and into the prompts themselves. According to Chapman University’s AI Hub, AI systems internalize implicit and explicit biases from both training data and human interaction, which can manifest as misleading or unfair outputs.
For example, a lawyer might ask an AI ““Explain why our interpretation of statute X is correct”. This kind of framing invites the AI to support a narrative rather than objectively analyze the statute. Because the model doesn’t reason like a human but instead predicts the most statistically likely continuation of your input, it will often lean into the assumptions in the prompt, thereby reinforcing a potentially incorrect or one-sided interpretation.
This is the danger is once someone thinks they’ve got the answer, they may stop critically examining the assumptions that led there.
Bias emerges from many places:
All of these can lead to unsafe or incorrect legal reasoning if unchecked.
AI tools can be extraordinarily helpful for drafting, summarizing, and surfacing insights. But they should never be the final authority on high-stakes decisions like legal analysis.
Here are some best practices:
No AI output, whether accounting, tax or legal reasoning, should be accepted without expert review. Even tools designed for law can be confidently wrong.
Avoid framing questions in a way that assumes a conclusion. Instead of “Tell me how this supports our case,” use “What are the relevant precedents on this issue?”
For citations and case law, cross-check with authoritative databases. Consider retrieval tools that tie back to original sources and include traceability.
Understanding prompt bias and data bias enables teams to interact with AI more critically rather than passively.
The topic of how to build provably accurate AI for accountants highlights a critical point: we need AI systems that don’t just sound right, but can be shown to be right within a defined context. That’s especially true in high-integrity domains like law and finance.
At Puzzle we are encouraged by recent AI developments around reasoning models, and orchestration layers, but think the real magic comes in the user experience that is auditable, traceable, and controllable, not one shot prompts and answers. The risks are too high.
As we move forward, combining technical rigor, human oversight, and careful prompt design will be essential to harness AI safely and responsibly. I am encouraged and I am encouraged as more accounting firms catch the AI wave.
note: models and model providers are changing fast. We are seeing massive funding rounds into legal startups like Harvey who are posting massive revenue growth and net dollar retention, so they clearly seem to be onto something. This is less of a judgement on startups, and more of helping highlight some of the challenges to building companies in serious industries, and balancing opinionated versus flexible systems.
Want to learn more? puzzle.io/for-accounting-firms





