Yesterday, a paper systematically studied why GPT-4 can "reduce intelligence", triggering extensive discussions in the AI ​​circle. As everyone uses GPT-4 more and more frequently, users will react intensively every once in a while, and GPT-4 seems to have become stupid again.


The recent situation is that if the user accidentally tells GPT-4 that it is December, the output content of GPT-4 will be significantly reduced.

A user specially conducted a test and told GPT-4 that it was May and December. Then he compared the output results and found that the results in December were much worse than those in May.


Everyone in the discussion thinks that GPT-4 will give itself a winter vacation and will not want to work until December.

But if we look at it in this paper, the author believes that the main reason is that the large model has a flaw that seems almost unsolvable now - the lack of continuous learning and evolution capabilities.


Paper address: https://arxiv.org/abs/2312.16337

We found that LLM performed significantly better on datasets released before the training data creation date than on datasets released after the training date.


LLM will present this situation whether it is zero-sample or multi-sample testing.

The paper also points out that LLM performs well on tasks they have actually "seen" before, but performs poorly on new tasks. The fundamental reason is that they just remember the answers and cannot effectively acquire new knowledge and understanding.

The reason why this performance difference is so huge lies in ‘task pollution’.


In the table above, the author found that task examples can be extracted from the GPT-3 model, and in each new version from davinci to GPT-3.5-turbo, the number of extracted training examples increases, which is closely related to the zero-shot performance improvement of the GPT-3 series models on these tasks.

To put it bluntly, the reason why the model performs well on the data set test before the deadline is because the training data already contains the problems in the data set.

This fully demonstrates that the performance enhancement of various versions of the GPT-3 series on these tasks is caused by task pollution.

For those classification tasks where there is no evidence of task contamination, large language models rarely significantly outperform simple majority baselines in zero-shot and few-shot settings.

In the table above, the researchers also list that for 51 model/dataset combinations with post-training data collection and no extraction task examples, only 1 combination of models can significantly outperform most benchmarks in zero-sample or few-sample settings.

This shows that once there is no possibility of task contamination, LLM's zero-sample and few-sample performance is actually not outstanding.

After reading this, netizens expressed pessimistically: It is currently difficult to build a machine learning model that can continuously adapt without causing catastrophic interference to encoded past knowledge and new knowledge.


ChatGPT is a snapshot of the past Internet - as the Internet changes, ChatGPT becomes obsolete in both knowledge and performance of useful tasks.

Both OpenAI and the big model companies have to face the fact that they have to constantly retrain new models.


Maybe, this is to some extent why people will find that ChatGPT becomes stupid again after a while. Maybe it's just because you keep testing it with new questions, and its true quality is slowly exposed.

test model

The researchers tested 12 models:

5 GPT models released by OpenAI and 7 open source LLMs.


For these models, they selected two sets of data sets just before and after the model training time for testing.


Test method

Timing analysis

The researchers then tested the performance of different models on the same two sets of data sets. It is obvious from the results that for data sets released after the model data training deadline, the zero-sample and multi-sample performance are significantly worse.


For 12 models and 16 datasets, the researchers conducted 192 model/dataset combinations.

Of these combinations, 136 datasets were published before the LLM training data collection date (pre-collection) and 56 datasets were published after (post-collection). For both sets, we calculate the percentage of model/dataset combinations in which the model beats the majority of baselines (zero-shot and few-shot).

The results are shown in Figure 1 below. We find that for datasets published before LLM was created, LLM is more likely to beat majority baselines on zero and minority sample settings.


For a single LLM, we further found:

Test each LLM individually. The results are shown in Figure 2 above. Such trends persist in models with the full range of dates, further suggesting that the absolute date of the dataset is not the primary factor, but rather that the variation in the date dataset relative to the date of LLM training data collection is the more important factor.

Task sample extraction analysis

If the LLM is able to generate examples that exactly match those in the test data, it proves that the LLM has seen the test set for the task during training.

The researchers used a similar approach to test task contamination. They do not try to generate test data, but instead prompt the model to generate training examples, since for zero or less evaluations, the model should not be trained on any task examples.

If the LLM can generate training examples based on hints, this is evidence of task contamination.

Table 4 below shows the task example extraction results for all tasks in all models.


Further researchers also found that for tasks that have not been demonstrated to have the possibility of task contamination, LLM rarely shows statistically significant improvements over most baselines.

In Table 4 above, for the 51 model/dataset combinations after collection and without extracting task examples, only 1 out of 51 model/dataset combinations (i.e. 2%) showed statistically significant improvement over most baselines in the zero- or few-shot setting.

Member reasoning analysis

To further examine the impact of training data contamination, the researchers applied a membership inference attack to check whether the content generated by the model exactly matched the examples in the dataset.


Figures 5a and 5b above show how many examples generated by the sampled training set and complete development set of GPT-3 series versions and the latest open source LLM are exactly the same.

Because the database schemas (atabase schemas) are not in the zero-shot hint, if the model can generate the exact same table or field names as in the training or development data, there must be contamination.

As shown in Figure 5, the number of examples generated by exact matching increases over time, indicating that the level of task pollution on Spider is increasing.

They also calculated execution accuracy after adding patterns to the prompts and plotted it against the number of exact matches (Figure 6). We find a strong positive correlation between the number of fully matched generated examples and execution accuracy (? = 0.88), which strongly suggests that increased contamination is associated with improved performance.


References:

https://arxiv.org/abs/2312.16337