A latest study from Washington State University in the United States shows that when faced with complex scientific assertions, the large language model ChatGPT often "guesses the answer" although its answers sound very confident. Not only does its accuracy have limited accuracy, but it is also inconsistent on the same question, making it especially difficult to identify false information.

The research was led by Mesut Cicek, an associate professor in the Department of Marketing and International Business at the Washington State University College of Business. He and his team extracted a large number of hypothetical statements from scientific research papers and repeatedly submitted them to ChatGPT, asking it to judge whether these statements were supported by existing research. In essence, let AI make judgments about "true or false". The researchers selected a total of 719 research hypotheses from business journal papers since 2021, and submitted each hypothesis to ChatGPT 10 times to examine the consistency of its answers.
In the first experiment, conducted in 2024, ChatGPT was "ostensibly" correct 76.5% of the time; when the experiment was repeated in 2025, that number rose slightly to 80%. However, after eliminating the "blind" factor and statistically adjusting the results based on random guessing, the research team found that the model's actual performance was only about 60% higher than the random answer by "tossing a coin", which is far from reliable. In the eyes of the researchers, it is closer to a "low-scoring D grade." Especially in identifying false statements, ChatGPT's performance is particularly weak, with a correct judgment rate of only 16.4% for "false propositions."
The issue of consistency is also prominent. Even if the question is repeated multiple times under the exact same prompt words, ChatGPT does not always give the same conclusion. Cicek noted that out of 10 repeated questions and answers, the model maintained consistent answers only about 73% of the time. In some specific examples, among the 10 answers to the same hypothesis, ChatGPT will appear in the situation of "true and false alternation", and even the extreme situation of "half of the answers are true and half of the answers are false".
The authors of the study, published in Rutgers Business Review, believe the results highlight the need for extreme caution when using generative AI in important decision-making areas, especially those involving complex reasoning and nuance. Cicek emphasized that current large-scale language models can answer questions with very fluent and persuasive language, but this does not mean that they have true "understanding capabilities." “Existing AI tools don’t understand the world in the same way that humans do — they don’t really have a ‘brain,’” he said. “They’re mostly memorizing and matching, which can provide some insight, but don’t really know what they’re talking about.”
On the specific method, the research team was completed by Cicek in collaboration with Sevincgul Ulu of Southern Illinois University, Can Uslay of Rutgers University, and Kate Karniouchina of Northeastern University. They selected research hypotheses from 719 business journal articles. Such hypotheses are often affected by multiple variables. Judging whether a study "supports" a certain hypothesis is itself a highly complex reasoning process. To compress this complexity into a simple "yes/no" judgment is a severe test of the tool's understanding and reasoning ability.
It is worth noting that the team tested the free version of ChatGPT-3.5 in 2024 and the updated ChatGPT-5 mini in 2025. The results showed that the overall performance of the two generations of models on this task was similar. After adjusting for the random guessing factor, the model's improvement compared to the 50% "guessing" probability in both experiments was only about 60%.
The study further pointed out that there is a significant gap between "language fluency" and "real reasoning ability" of large language models. These systems can produce well-structured, naturally worded, and persuasive text, but they often struggle with deeper logical judgments, weighing evidence, and identifying misinformation, which can result in answers that sound right but are actually problematic.
Based on the above findings, the researchers recommend that business managers and decision-makers should always verify the output results and maintain a necessary skepticism when using generative AI tools such as ChatGPT. They also called for greater user training within organizations to help employees understand the strengths and limitations of such tools and avoid viewing them as "authoritative" substitutes for professional judgment. Cicek pointed out that although the subject of this study was ChatGPT, other similar AI systems performed roughly the same in related tests. This work also continues the previous research on "overhype of AI". For example, a 2024 national survey showed that when companies emphasize “powered by AI” in marketing, it actually reduces the purchase intention of some consumers.
“No matter what, be skeptical,” Cicek said. “I’m not against AI, I use it myself, but you have to be very careful with it.”