Don't rush to program with GPT-5, it may not be as powerful as you think. Someone discovered that the SWE-bench Verified, the official test for programming ability, was wrong and only used 477 questions. What does it mean? We know that SWE-bench is a common and commonly used indicator to evaluate the autonomous programming ability of a model/agent. As a subset of it, SWE-bench Verified originally had a total of 500 questions.


Now it is equivalent to the 23 questions that OpenAI omitted by itself, and it has been solved by itself"subset" of subsetto evaluate model capabilities.

And if these questions default to zero points, the score is actually lower than Claude Opus 4.1. Because now the difference is only 0.4%.


This is not the first time that OpenAI has ignored 23 questions.

As early as when GPT-4.1 was released, they vowed that they were ignored because the solutions to these problems could not run on their infrastructure.


Outrageous, friends! You must know that SWE-bench Verified was proposed by OpenAI itself. The reason is that SWE-bench cannot systematically evaluate the programming ability of the model, so it decided to refine a subset by itself.

Now because the test questions cannot run normally, I have created a "subset" of the subset myself.

I thought it was outrageous enough that there were chart errors in the GPT-5 live broadcast, but now you tell me that the results may be fake?


OpenAI keeps omitting 23 questions

Some netizens have begun to discover that the capabilities of GPT-5 are not much better than Claude 4.1 Opus.

Looking at it now, this official result may have no reference value at all.

In addition to the discovery that netizens "falsified the results" by ignoring some test questions, they also discovered that they were comparing GPT-5, which has the greatest thinking effort, with Opus 4.1, which does not extend thinking and only relies on the output of the original model. This comparison is actually meaningless.


The reason why they only used 477 questions for testing was the same as when GPT-4.1 was released, because their internal infrastructure could not run the remaining 23 questions.


When GPT-4.1 was released in April this year, it scored 54.6% using only 477 questions on the same benchmark.

Officials also pointed out at the time that if the scores for these questions were conservatively set at 0, the score of 54.6% would become 52.1%. Even so, this value was the highest at the time.


As for Anthropic, they have actually discovered the OpenAI operation.

Just when Claude Opus 4.1 was released and announced the programming results, there was this sentence at the end of the article.


For the Claude 4 series models, they continue to use the same simple framework, which only equips the models with two tools - a Bash tool and a tool for file editing via string replacement, and no longer includes the third "planning tool" used in Claude 3.7 Sonnet.

And note at the end: In all Claude 4 models,The scores they report are based on the full 500 questions. OpenAI model scores are reported based on a subset of 477 questions.


The benchmark is proposed by OpenAI itself.

If we say,SWE-bench VerifiedIf this is the benchmark proposed by OpenAI itself, then this matter is even more outrageous.

Isn't this equivalent to shooting yourself in the foot?


At that time, it was for similar reasons - their tests found that some tasks of SWE-bench may be difficult or even impossible to solve, resulting in SWE-bench being unable to systematically evaluate the model's autonomous programming capabilities.

So, they decided to cooperate with the author of SWE-bench and decided to come up with a new version, hoping to provide a more accurate evaluation.

Together, they launched a manual annotation campaign involving 93 senior programmers to screen each sample of the SWE-bench test set for appropriately scoped unit tests and clearly specified problem descriptions.

They randomly selected 1699 samples and then labeled them based on unified standards.

For example, is the problem description clear? Each annotation has a label ranging from [0, 1, 2, 3] with increasing severity.

Labels 0 and 1 represent minor; labels 2 and 3 represent severe, indicating that the sample is defective in some way and should be discarded.


Additionally, we evaluate the difficulty of each example by asking annotators to estimate the time it would take a developer to identify and implement a solution.

Finally, 500 verified samples were obtained, and the data set was subdivided according to difficulty. The “easy” subset contains 196 repair tasks that take less than 15 minutes, while the “hard” subset contains 45 tasks that take longer than 1 hour.

As a result, this subset has now been reduced by OpenAI.

One More Thing

However, there is still a general list that may be worth referring to, which is the original SWE-bench.

In this list, Claude 4 Opus still occupies the leading position.