As large inference models such as DeepSeek and o1/o3 continue to bring shocks,Someone is starting to study their weaknesses.. The latest research reveals that when encountering difficult problems, large reasoning models may frequently switch problem-solving ideas like "half-hearted students" but fail due to lack of in-depth exploration - this phenomenon is called by researchersUnderthinking(less thinking).
The research team comes from Tencent AI Laboratory, Suzhou University and Shanghai Jiao Tong University. The main research objects are open sourceDeepSeek-R1 and QwenQwQSeries models.
By analyzing the wrong answers of AI, they found that the current large-scale reasoning model often takes the correct route in the early stages of thinking, but tends to "tamper with the basics" and quickly begins to explore other ideas, resulting in the subsequent generation of thousands of tokens that do not contribute to solving the problem.
This "ineffective effort" not only wastes computing resources, but also significantly reduces the accuracy of answers.
"Half-heartedness" is the culprit
This phenomenon is particularly evident when solving more complex tasks such as math competition problems.
For system analysis, the team conducted experiments on o1-like models QwQ-32B-Preview, DeepSeek-R1-671B, etc. on three challenging test sets MATH500, GPQADiamond and AIME2024.
The figure below compares the token usage and number of thought switches in correct and incorrect answers. On average, o1-like modelsWrong answers consume 225% more tokens than correct answers., the reason is that the frequency of thought switching increased by 418%.
To analyze this phenomenon in depth, the research team developed an evaluation framework for judging whether the abandoned reasoning paths were actually sufficient to derive the correct answer.
It was observed that many models had correct ideas at the beginning of the answer, but did not proceed further to complete the reasoning.
More than 70% of wrong answers contain at least one correct idea. In addition, among more than 50% of wrong answers, more than 10% of the ideas are correct.
As shown in the example below, for example, Thought1 initiated the correct interpretation by identifying that the given equation was similar to the equation of an ellipse centered at (0, 0) and (20, 11). Setting two expressions equal is an efficient way to find the common point (x, y) that satisfies both equations.
However, the model did not focus on in-depth exploration of this reasonable idea and use further algebraic operations and optimization techniques for analysis. Instead, it frequently switched ideas, consuming an additional 7270 tokens, but still failed to arrive at the correct answer.
Ultimately, it leads to a guess answer that lacks support for the extended COT process.
Based on these observations, the researchers proposed a metric (UnderthinkingMetric) for quantifying the degree of Underthinking.
This indicator evaluates reasoning efficiency by measuring the token usage efficiency in wrong answers, and calculates the ratio of the number of tokens required from the beginning of the answer to the appearance of the first correct idea to the total number of tokens.
Experimental results show that all tested o1-like models have significant thinking deficiencies. The relationship between model accuracy and insufficient thinking appears differently across different datasets.
On the MATH500-Hard and GPQADiamond data sets, the DeepSeek-R1-671B model with better performance not only achieves higher accuracy, but also has a higher UT score, indicating that there are more insufficient thinking in wrong answers.
This means that although the model is more capable overall, it may generate a longer but less efficient inference process under uncertainty, possibly because the model explores multiple wrong reasoning paths but fails to effectively converge to the correct answer.
On the contrary, in the AIME2024 test set, the DeepSeek-R1-671B model not only achieved higher accuracy, but also showed a lower UT score, reflecting less insufficient thinking and higher token efficiency.
This shows that the model's reasoning process remains focused and efficient in this task even if it does not arrive at the correct answer. The team said this may be because the model is better aligned with the question types and reasoning processes required by AIME2024.
Understanding the phenomenon of underthinking is critical to developing models that provide correct answers and have efficient reasoning processes.
How to make AI learn to be "single-minded"
How to make the model "sink down and study" like an excellent student?
The researchers drew on human test-taking strategies and proposed a"Punishment mechanism for switching ideas"(ThoughtSwitchingPenalty, TIP).
The principle is similar to setting rules for yourself during an exam: "Focus on the current method first, try it for at least 10 minutes before changing your ideas."
In terms of technical details, TIP will impose penalties on keywords that trigger thought switching, reduce the probability of these words being generated during the decoding process, and force the model to explore the current path for longer.
For example, when the model starts to write "Alternatively, wecanconsider...", TIP will suppress this premature switching tendency by adjusting the parameters (penalty intensity α and duration β).
Experimental results show that adding TIP can increase the accuracy of the model in mathematical tests, while UTScore decreases, indicating that it not only reduces invalid switching, but also improves the quality of answers.
For example, in the AIME2024 mathematics competition test, the accuracy of the QwQ-32B-Preview model added to TIP increased from 41.7% to 45.8%, while the UTScore dropped from 72.4 to 68.2.
And this "painless upgrade" does not require retraining the model, but only needs to adjust the decoding strategy, demonstrating its practical value.
OneMoreThing
UC Berkeley Professor Alex DimakisSimilar observations were shared around the same time:
For DeepSeek-R1 and all inference models, wrong answers are longer, while correct answers are much shorter.
Based on this, they proposed a simple solution called"Concise decoding"(Laconic decoding).
Run the model 5 times in parallel and choose the answer with the fewest tokens.
Preliminary experimental results show that concise decoding can improve the accuracy by 6%-7% in the AIME2024 test, which is better and faster than ConsensusDecoding.
Paper address: https://arxiv.org/abs/2501.18585
Reference links:
[1]https://x.com/tuzhaopeng/status/1885179412163027406
[2]https://x.com/AlexGDimakis/status/1885447830120362099