Can GenAI do math? A reality check

Don't get rid of your financial calculators just yet...

Jun 03, 2025

GenAI, are you paying attention? (made with Copilot)

With all the excitement surrounding generative AI, no doubt many professionals are wondering: Can large language models (LLMs) handle finance? Can they crunch numbers reliably? Can they model earnings or calculate growth rates?

The answer, as we discovered in our own tests inspired by Bradford Levy’s recent paper “Caution Ahead: Numerical Reasoning and Look-ahead Bias in AI Models,” is not yet, and not without human oversight.

But first...

5 things to know

📰 Quote: “Something between 5% and 15% of people are finding a use for [AI Chatbots like ChatGPT] every day…it’s important to remember that if you use five different LLMs every day, and haven’t done a Google search this year, and all your friends are the same… then you’re in a bubble, for now.” [Benedict Evans]

▶️ Video: A day with Claude | Anthropic

🎤 GenAI Prompt: Compile a literature review of the academic research conducted so far that answers these questions: …

🎓 Learning: Generative AI for Executives - AWS Skill Builder. This free course explains the foundational concepts and terminology of generative AI and how to drive business value with generative AI.

📅 Event: AI and the Modern Product Manager: AI-Accelerated PM – Tools & Workflows to Stand Out | University of Maryland - Date: Tuesday, June 10, 2025 at 9:00 AM PT (12:00 PM ET) | Free

Number talk

Generative AI can draft memos, interpret contracts, and even decode PDFs. But when it comes to actual finance math - percentages, ratios, or cash flow modeling - can you trust it?

We ran a diagnostic test across five core financial tasks using Claude, inspired by a recent paper from Bradford Levy. The verdict? Impressive talk, shaky math.

A new paper by Bradford Levy of the University of Chicago Booth School of Business, titled “Caution Ahead: Numerical Reasoning and Look-ahead Bias in AI Models,” offers a firm reality check. His core finding: LLMs often appear smart in finance because they memorize patterns, not because they understand principles or can reason with numbers.

To further test whether GenAI can do real finance work, we evaluated Claude on:

Basic arithmetic
Simple interest
Compound interest
P/E ratio calculation
Break-even analysis

All five tests failed. Results were inconsistent, sometimes wildly off. And even simple scenarios produced wrong answers.

Lessons learned:

1. LLMs don’t actually possess computation capabilities

LLMs often return the correct answer for simple finance problems, but that doesn’t mean they actually did the math. A main reason for this is that LLMs are not deterministic engines, that is, their outputs are not reproducible if rerun due to complexities inherent in the systems. In other words, hallucinations and math don’t go together.

Some of our LLM test answers ranged from “close enough” to magnitudes off-base (10x). Problems involving larger values tended to be more prone to error.

Or, as Levy puts it, “LLMs exhibit extremely poor numerical reasoning…”

He continues: “In my first set of tests, I explore this question by directing LLMs to perform foundational accounting operations: adding and subtracting numbers sampled from annual balance sheets. I find that while the LLMs can add and subtract two numbers, this performance degrades significantly when asked to tally increasingly many numbers, i.e., to near zero accuracy…”

While summarizing other test scenarios, Levy concludes:

“Collectively, my results suggest that LLMs should be applied judiciously within the realm of numerical reasoning involving typical financial statements and that performing actual calculations should be offloaded to traditional tools.”

2. Make sure your test cases are complex

If you, like us, want to evaluate LLMs for your own financial use cases, you need to make sure you try a range of test problems, including some complex permutations.

Simply asking an LLM to calculate 10% of $100 may mislead you. In an LLM’s training set, it’s quite likely the tokens 10%, $100, and $10 (the answer, obviously) appear logically connected. So, it shouldn’t be a surprise that an LLM would return $10 as an answer. But don’t think that it actually calculated the answer.

Levy notes: “These results suggest that naively handing an LLM a sequence of numbers and expecting the model to carry out basic accounting operations will lead to poor results–as is common in prior accounting literature.”

But while LLMs may answer problems containing clean, round numbers accurately (e.g. a 5% return on $400), swap in decimals - like 4.026% or $397.25 - and the training set dries up.

For example, when we prompted it, Claude correctly 'computed’ an 8% return on $50.55, but incorrectly 'computed' an 8.099% return on $8,033.72.

Real-world data isn't neat. AI often can’t cope. So don’t just run a simple test and conclude your LLM can handle real math.

3. Good at framing, bad at solving

LLMs were strong at translating word problems into formulas and financial frameworks in our testing.

Levy suggests a reason: “Prior literature in computer science has generally found that even examples occurring only once in the training data used for LLMs are likely to be memorized by the model.”

So, use the model to draft your problem-solving approach. Then do the math yourself.

4. Multimodal extraction doesn’t mean reasoning

For one test, we fed an LLM an image of Walmart’s balance sheet and asked it to extract values and perform calculations. Impressively, Claude correctly ‘calculated’ the gross margin % using the gross margin dollars and revenue entries from the image. But when we asked it to recalculate gross margin % after increasing revenue and gross margin $ by a few thousand dollars each, the LLM miscalculated.

LLMs can see the numbers - but not always understand what to do with them.

5. Leverage coding mode or other systems for more accurate results

Levy notes that using GenAI to generate code that processes data can result in more accurate responses compared to directly asking the LLM chatbot.

“First, while the ability of LLMs to perform calculations on their own is quite poor, they are capable of writing functioning Python code to perform the calculations. Second, researchers may be able to completely avoid the pitfalls associated with LLMs’ poor numerical reasoning abilities by offloading numerically intensive tasks to tools. This is relatively straight-forward to achieve for models trained to support tool use, e.g., Llama 3, as it requires a single modification to the system prompt directing the model to write and execute code and then handling such invocations of the interpreter.”

We also experimented with using ‘Deep Research’ mode on Google’s Gemini model. Interestingly, Gemini informed us it would seek out an online calculator to assist with computations. This not only is a fascinating extension of LLMs, but also confirms Levy and our conclusions about the limited computational capabilities of LLMs. Why seek out a calculator if an LLM could handle the calculations itself?

According to Gemini, it followed this deep research process to answer a simple financial return question (e.g. what’s an 8.099% return on $8,033.72):

Research Websites
(1) Identify the mathematical operation required to determine a percentage return on a principal amount.
(2) Search for a reliable online calculator that can compute percentages.
(3) Access and navigate to a suitable online percentage calculator webpage.
(4) Input the given principal amount ($8,033.72) and the percentage (8.099%) into the calculator fields.
(5) Extract the resulting value displayed by the calculator, which represents the return.

The result? It seemed promising, but Gemini was still ‘thinking’ when I cancelled the request 10 minutes in with no answer in sight. This question would have taken 30 seconds in Excel, by comparison.

Conclusion

Levy's warning was spot-on, based on our tests. Today’s LLMs can support finance work - structuring research, analyzing trends, interpreting visuals - but they remain unreliable when it comes to actual math. However, a workaround is to use LLMs to draft Python code to perform the calculations indirectly. Or, just use Excel.

Adventure on.