GPT-4 doesn’t know it’s wrong LLM’s new flaw exposed: self-correction success rate is only 1%

GPT-4 doesn’t even know it made a mistake? The latest research found that in reasoning tasks, LLM cannot save performance from deterioration after self-correction, attracting AI tycoon LeCun Marcus to watch. Another major flaw was revealed in the large model, which attracted the attention of two big guys, LeCun and Marcus, who simultaneously forwarded it!

In the inference experiment, the model that claimed to improve accuracy self-corrected, "improving" the accuracy from 16% to 1%!

To put it simply, LLM cannot improve its output through self-correction in reasoning tasks, unless LLM already knows the correct answer during the self-correction process.

Two papers published by ASU researchers refuted the method of "self-correction" proposed by many previous studies - allowing large models to self-correct their own output results can improve the output quality of the model.

Paper address: https://arxiv.org/abs/2310.12397

Paper address: https://arxiv.org/abs/2310.08118

Professor Subbarao Kambhampati, the co-author of the paper, has been committed to research on AI reasoning capabilities. He published a paper in September and even completely denied the reasoning and planning capabilities of GPT-4.

Paper address: https://arxiv.org/pdf/2206.10498.pdf

In addition to this professor, recently researchers from DeepMind and UIUC University have also questioned the ability of LLM to "self-correct" in reasoning tasks.

This paper even calls on all scholars doing relevant research to take your research seriously and not to tell the big model the correct answer and then let it perform so-called "self-correction."

Because if the model does not know the correct answer, the output quality will decrease after the model "self-corrects".

https://arxiv.org/abs/2310.01798

Next, let’s take a closer look at these two latest papers.

GPT-4 "self-corrects" and the output results are worse

The first paper studies GPT-4, letting GPT-4 provide solutions to graphics coloring problems, and then letting GPT-4 "self-correct" its proposed solutions.

At the same time, the author introduces an external evaluation system to evaluate the direct output of GPT-4 and the output after the "self-correction" cycle.

Experimental results show that GPT-4's accuracy in guessing colors is less than 20%, which does not seem surprising.

But surprisingly, the accuracy in "self-correction" mode drops significantly (second bar in the image below) - completely contrary to all self-correction intentions!

The author believes that this seemingly counter-intuitive situation can be explained as follows: GPT-4 also performs very poorly at verifying correct answers!

Because even when GPT-4 accidentally guesses the correct color, its "self-correction" makes it think that there is something wrong with the correct answer, and then replaces it with the correct answer.

Further research also revealed that GPT-4 does improve its solution if an external validator provides a provably correct answer to the color guessed by GPT-4.

In this case, the prompt words generated by "self-correction" can indeed improve the quality of the output results (3-5 bars in the above figure)

To sum up, for the "colorization problem" task, GPT-4's independent "self-correction" will actually harm the output performance, because GPT-4 cannot verify whether the answer is correct.

But if a correct external verification process can be provided, the "self-correction" generated by GPT-4 can indeed improve performance.

Another paper studied the "self-correction" ability of large language models from the perspective of planning tasks, and the research results were similar to the previous paper.

Moreover, the researchers found that it was not LLM's "self-correction" that really improved output accuracy, but the feedback from an external, independent verifier.

In the final analysis, LLM has no way to conduct independent verification and must rely on the "correct answer" given by an external verifier in order to effectively "self-correct".

"Coloring problem" performs poorly, LLM cannot independently verify correct answer

research design framework

"Coloring problem" is a very classic reasoning problem. Even if it is not difficult, the answers are diverse enough, and the correctness of the answers is easy to verify.

The diversity of results makes it difficult for LLM training data to cover all the data, so as to avoid the possibility of LLM training data being contaminated.

These reasons make the "coloring problem" very suitable for studying the reasoning ability of LLM, and it is also convenient for studying the ability of LLM to "self-correct" in reasoning.

The researchers built their own dataset, using GrinPy2 to handle common graph operations. Each plot was constructed using the Erdos-Rényi method (˝p=0.4).

Once the correct answer is found, it is compiled into standard DiMacS format and appended with an annotation containing its precomputed chromaticnumber.

For the following experiments, the researchers generated 100 instances, each with an average of 24 edges, spread over a range of node counts from 10 to 17—a distribution because experience showed it was a range with sufficiently variable behavior.

The diagram used by the researchers is shown in Figure 1 below. This process includes the LLM's first reply, the return prompt (backprompt) of the reply, and the final correct color scheme.

Schema of iterative backprompting

PromptGenerator:

This prompt generator takes a DIMACS instance and translates each edge into a sentence, then wraps the whole in a set of common instructions to construct a natural language prompt.

We intentionally narrow the differences between different instance prompts to reduce the amount of problem-specific information we leak to LLM. Examples of various types of prompts can be found in the appendix.

Large language model:

GPT-4 is called through OpenAIAPI, which is the current state-of-the-art model.

The researchers provide a system role: "You are a constraint satisfaction solver that solves various CSPs (constraint satisfaction problems)".

Return prompt word generation (BackpromptGeneration)

In verification mode, LLM receives a different type of prompt.

In addition to the standard instructions, it contains only a description of the figure and a suggested coloring scheme. Its task is to verify correctness, optimality, and whether each vertex has been painted with a color.

If a set of edges in the generated reply is contradictory, then the coloring scheme is wrong.

To compare each point, the researchers also built a validator that lists every contradictory edge.

Since LLM responses are also in natural language, the researchers first translated them into a format that facilitated analysis. To make the process more consistent, the researchers designed initial prompts to describe the precise output format a model needs to follow. The response is then evaluated for correctness.

To judge LLM validation results, researchers examine how well they perform at finding errors in proposed coloring schemes.

Intuitively, these should be easy to identify: if two vertices that make up an edge share a color, return that edge immediately. From an algorithmic perspective, it's just a matter of detecting all edges and comparing the color of each vertex to the color of its connecting points.

verify

To gain more insight into the verification capabilities of LLMs, the researchers studied their performance in finding errors in proposed coloring schemes.

Intuitively, these errors should be easy to spot: if two vertices that make up an edge share a color, that edge is returned immediately. From an algorithmic perspective, all that needs to be done is iterate through all the edges and compare the color of each vertex to the color of its corresponding vertex.

The researchers used the same analysis process but constructed a new domain that the researchers called color_verification. LLM is directed to check the coloring for correctness, optimality, and whether each vertex has been assigned a color.

If the coloring is incorrect, it is instructed to list the error in the coloring, i.e. if two connected nodes share a color, return the edge to represent the error. No backprompts are given.

The researchers used the same graph instance as before, but generated four coloring schemes for testing the model:

Correct: An error-free optimal shading scheme generated by an iterative, stochastic greedy algorithm (using precomputed color numbers to ensure optimality).

Ablated: Changes a random node from a previous set of coloring schemes to the color of its neighbors.

Non-optimal: Randomly select a color part from the correct set and recolor it to a new hue.

Random: Completely randomly assigned colors, the number of different colors is equal to the number of colors in the image.

LLM: A coloring scheme randomly selected from the output generated by the LLM in the previous experiment.

in conclusion

The LLM was prompted, evaluated for answers, and moved on to the next instance without any backprompts, resulting in a baseline score of 16%.

When the researchers ran the same instance, but this time using feedback generated by the same language model as the verifier for return prompts, performance dropped dramatically—only one out of 100 instances was answered correctly.

Returning hints with an externally qualified validator may initially appear to be more effective.

The number of instances answered correctly was closer to 40%, but if this means that GPT-4 is listening, improving, and reasoning based on feedback, then the researchers expect that more accurate return prompts will lead to better results.

However, in this domain, the raw scores (see Figure 2 above) do not bear this out.

LLM’s verification capabilities

The researchers tested GPT-4's ability to verify graph coloring schemes on the same instances, generating five different types of coloring schemes for each instance.

The obvious result is exactly the same as the LLM self-correction result above: the model is almost reluctant to mark any answer as correct. Out of 100 optimal coloring schemes, it only agreed on 2 of them being correct.

Of the entire collection of 500 shading schemes, 118 of them are correct, and it only claims that 30 of them are correct. Of those 30, only 5 were actually correct.

Overall, the pattern remains unchanged. In less than 10% of cases, LLM gave a "correct", "non-optimal" or "missing assignment" response. In these cases, the behavior appears somewhat random.

In about a quarter of the instances it responds with a "that's not correct" validation, the explanation matches reality, and it does this by specifying no more than one edge, thus minimizing the chance of misrepresenting something.

The results are shown in Table 2 above. Note that as the error rate of the domain increases, the hallucination proportion decreases. That is, when there are more incorrect edges, the model is more likely to pinpoint the errors in them.

LLM self-criticizes and performance decreases instead of increasing

In the paper submitted on the 12th, the author also reached a conclusion consistent with the above.

Whether it is planning, simple arithmetic or logic, the current most advanced large model GPT-4 is not fully competent.

Many researchers have explored and improved it, including allowing LLM to learn self-iteration, self-verification and other strategies to improve performance.

As a result, people in the industry are optimistic that large models can still be saved!

However, the complexity of inference tasks in the classic sense is irrelevant to large models because LLM is a model that uses approximate retrieval rather than precise inference.

In a paper submitted to arXiv on the 12th, ASU researchers systematically evaluated and analyzed LLM's self-criticism and iterative optimization capabilities in planning tasks.

In the study, the author proposed a planning system containing a generator LLM and a verifier LLM.

Among them, the GPT-4 generator is responsible for generating candidate plans, and the GPT-4 verifier is responsible for verifying the correctness of the plan and providing feedback.

The researchers then conducted experiments on the Blocksworld planning domain and empirically evaluated:

-The impact of self-criticism on the plan generation performance of the entire LLM+LLM system

- Performance of the verifier LLM relative to ground truth verification;

- The impact of feedback levels on overall system performance when criticizing LLM generation.

The results show that self-criticism reduces LLM plan generation performance compared to using an external reliable validator.

The performance degradation can be directly attributed to the poor results of the validator LLM, which generates a large number of false positives, which can seriously damage the reliability of the system.

The binary classification accuracy of the verifier LLM is only 61%, and there are a large number of false positives (judging wrong planning as correct).

In addition, based on the comparison of the detailed level of feedback, it was found that it has little impact on the performance of plan generation.

Overall, this study's systematic investigation provides preliminary evidence calling into question the validity of LLM as a validator of planning tasks within an iterative, self-critical framework.

Author introduction

SubbaraoKambhampati

SubbaraoKambhampati is a professor of computer science at Arizona State University. Kambhampati studies fundamental issues in planning and decision-making, particularly motivated by the challenges of human-perceiving artificial intelligence systems.

References:

https://twitter.com/rao2z/status/1715800819239678013

https://twitter.com/GaryMarcus/status/1715804178470387736