Princeton University’s surprising discovery: GPT-4’s success rate in solving GitHub programming problems is 0

AI coding tools like ChatGPT are coming fiercely, and StackOverflow is laying off employees again! However, Princeton and the University of Chicago discovered that when faced with real-world GitHub problems, GPT-4’s resolution rate was actually 0%. StackOverflow has been launched by ChatGPT! Because coders are flocking to ChatGPT and GithubCopilot in large numbers, StackOverflow has no choice but to announce the layoff of more than 100 employees today, accounting for almost 1/3 of its employees.

So, are AI coding tools like ChatGPT really going to subvert the entire industry?

However, a recent study by Princeton and the University of Chicago found that it is not that easy for LLMs to replace coders.

Paper address: https://arxiv.org/abs/2310.06770

In the face of 2294 real GitHub issues, the pass rate of GPT-4 in solving random GitHub issues was actually 0%!

Even the best model Claude2 can only solve 1.96% of them.

Will coders lose their jobs because of ChatGPT? The answer is - absolutely not at the moment.

Adapt or perish

As the favorite coding assistance website for every developer in the world, StackOverflow was still in good shape before. It launched a recruitment frenzy last year, and the number of employees in the entire company doubled to 540.

However, everything has changed since OpenAI released ChatGPT last November.

The help provided by the AI chatbot is more specific than the forum post from 5 years ago. Through LLM, developers can instantly correct the exact code, optimization suggestions, and descriptions of what each line of code is performing.

Although the answers provided by LLM are not 100% reliable, the code has the unique ability to immediately verify the code by simply testing it in the IDE integrated development environment. All this makes writing code an ideal use case for ChatGPT.

As a result, StackOverflow traffic has been greatly reduced, and AI programming tools such as ChatGPT and GPT-4-driven GithubCopilot have become new places for coders.

Today, CEO Prashanth Chandrasekar announced that StackOverflow will lay off more than 100 employees, accounting for 28% of its total employees.

The CEO's explanation for the layoffs is that under macroeconomic pressure, StackOverflow is working hard to get on the road to profitability and constantly launching product innovations.

Burning bridges?

The biggest irony of ChatGPT’s impact on StackOverflow is that the powerful capabilities of large language models largely come from crawling websites like StackOverflow.

Big language models suck up this data without giving anything back. What will happen if all data sources are forced out of the business?

Now, many technology companies already have an imminent problem: if there are fewer programmers, there will be less artificial data.

How to train a new AI model if there is no latest data?

Want to use our data? bring money

Of course, StackOverflow cannot sit still and wait for death. It has chosen two ways to save itself -

One is to develop its own AI coding tool OverflowAI, and the other is to directly seek cooperation with technology companies such as OpenAI, because these companies will use StackOverflow data to build AI models.

It is reported that OpenAI is developing web crawler controls for ChatGPT so that data from websites like StackOverflow will not be crawled.

The CEO said that StackOverflow has made its position clear: whoever wants to use our data to train LLM will pay.

The CEO believes that sites like StackOverflow are critical to the development of large language models, and in order to progress, they need to be trained on new knowledge.

StackOverflow CEO Prashanth Chandrasekar

LLM wants to recruit code farmers, but it’s still early

So, can big language models really defeat coders?

A team from Princeton and the University of Chicago discovered that it’s not that easy!

In the latest paper, researchers proposed a new framework SWE-bench to evaluate the ability of large models to solve 2294 real GitHub problems.

It was found that the ability of leading large models such as GPT-4 and Claude2 to solve practical problems is only 5%.

To be more specific, GPT-4 can solve random GitHub issues with a pass rate of 0%, and the best model, Claude2, can only solve 1.96% of them.

What's more worth mentioning is that when using BM-25 to retrieve the relevant code files for each issue, only 23% of the patches written by Claude2 were effective (can be used in the repo), and only ~1% actually solved the problem.

In addition, different models also have different performance in solving 12 popular Python library problems.

The GPT-4 large model has achieved such results, which is really surprising. After all, many people have already regarded it as a "programming tool."

But you need to see clearly the true strength of AI, and don’t get worried because of bad rankings.

Some netizens said that this is the best answer to the question "Are coders unemployed because of programming?"

Finally someone made a real eval dataset for code models, HumEval is just LLM's leetcode interview. We all know this is a false metric for human engineers. Less than 4% sounds about right, as the big models are still far from fully autonomous.

So, is the result of SWE-bench evaluating the ability of large models really true?

SWE-bench: designed for coding models

In this study, the author found that many current benchmarks for evaluating the coding capabilities of large models have become saturated and cannot evaluate the true strength of large models.

For example, in HumanEval, the challenge problem is too simple, and LLM only needs a few lines of code to solve independent problems.

However, in reality software engineering is not so simple.

Fixing a bug may require browsing a huge resource library, understanding the relationships between functions in different files, or finding a small error in intricate code.

Inspired by this, researchers from Princeton and University of Chicago introduced SWE-bench.

SWE-bench fetches task instances from real Python code bases by connecting GitHub issues and merge request solutions that address related tests.

As shown, the model's task (usually a bug report or feature request) is to resolve issues submitted to the GitHub repository.

Each task requires generating a patch and describing the changes to be applied to the existing code base.

Then use the warehouse's testing framework SWE-bench to evaluate the modified code base.

In order to find high-quality large-scale task examples, researchers passed three stages of screening:

Phase one: warehouse selection and data search.

We first collected pull requests (PR) from 12 popular open source Python code repositories on GitHub, resulting in a total of approximately 90,000 PRs.

The researchers focused on popular repositories because these tend to be better maintained, have clear contributor guidelines, and have better test coverage. Each PR has an associated code base, which is the state of the warehouse before the PR is merged.

Phase 2: Attribute-based filtering.

Candidate tasks are created by selecting merge PRs that: (1) resolve a GitHub issue; (2) modify the repository's test files, indicating that users are likely to have contributed tests to check whether the issue has been resolved.

The third stage: execution-based filtering.

For each candidate task, the test content of PR will be applied, and the relevant test results before and after applying other content of PR will be recorded.

The researcher filters out task instances that do not have at least one test whose status changed from failed to passed (hereinafter referred to as "failed to passed tests"). In addition, instances that cause installation or running errors are filtered out.

Through these stages of screening, the original 90,000 PRs were filtered into 2,294 task instances, which constituted SWE-bench.

As shown in Figure 3 below, the final classification of these task instances in different resource libraries is shown. The table is the main characteristics of the SWE-bench task instances.

The researchers emphasized that these code bases are large, containing thousands of files, and reference pull requests often modify multiple files at the same time.

Compared with existing LM programming benchmarks, SWE-bench has several advantages.

These include leveraging real-world settings for user-submitted problems and solutions, diverse input featuring unique code issues from 12 repositories, a powerful execution-based evaluation framework, and the ability to continuously update the benchmark with new instances with minimal human intervention.

LLM tasks: edit the code base and solve problems

Researchers will provide the large model with a textual description of the problem, as well as a complete code base.

The task of the large model is to edit the code base to solve the problem.

In practice, researchers represent modifications as patch files, which specify which lines in the code base are to be modified to fix the problem.

How to evaluate whether the solution given by LLM is good or not?

Researchers will use Unix patches, apply the generated patches to the code base, and then perform unit and system tests related to the task instance.

If the patch is applied successfully and passes all these tests, the solution proposed by LLM can be considered to have successfully solved the problem.

The baseline metric is the percentage of resolved task instances.

Building a unique dataset for SWE-bench

Traditional NLP benchmarks usually only involve short input and output sequences and consider some "artificial" problems created specifically for the benchmark.

In contrast, to build SWE-bench, researchers inject unique attributes into the dataset.

For example, real software engineering tasks are used.

Since each task instance in SWE-bench contains a large and complex code base and related problem description, solving SWE-bench requires complex skills and knowledge possessed by experienced software engineers, but these are usually not evaluated in traditional code generation benchmarks.

Moreover, the collection process can be easily applied to any Python repository on GitHub with little to no human intervention.

Therefore, researchers can extend SWE-bench by continuously providing new task instances and evaluate the language model on problems created after the training date, ensuring that the training corpus does not contain solutions.

In addition, the researchers ensured diverse long inputs, robust evaluation, cross-context code editing, wide scope of solutions, etc. in the benchmark.

Fine-tuning SWE-Llama

Next, it is time to evaluate the effects of open models and proprietary models in the SWE-bench framework.

However, researchers found that the off-the-shelf CoDELLama fine-tuning model cannot follow detailed instructions to generate code edits across the entire repository, and often outputs placeholder responses or irrelevant code.

To evaluate the capabilities of these models, the researchers performed supervised fine-tuning (SFT) on the 7 billion-parameter CodeLlama-Python model and the 13-billion-parameter CodeLlama-Python model.

The resulting model is a specialized repository editor that runs on consumer-grade hardware and solves GitHub issues.

Big models are defeated

Next, the researchers evaluated GPT-3.5, GPT-4, Cluade2, and fine-tuned models.

It turns out that all the models failed - they couldn't solve all but the simplest problems.

For example, Claude2 and GPT-4 can only solve 4.8% and 1.7% of the tasks respectively.

After using the BM25 retriever, Claude2's performance dropped further to 1.96%.

Different resource libraries have different levels of difficulty.

If you break down performance by pool, you'll see that all models show similar trends across pools.

Nonetheless, the problems addressed by each model do not necessarily overlap extensively. For example, in the oracle setting, Claude2 and SWE-Llama13b performed comparably, with each model solving 110 and 91 instances respectively.

Difficulty is related to context length.

Models can be pretrained on long sequences of code, but typically require generating a single function at a time and providing limited context to frame the problem.

As shown in the figure, it can be seen that as the total length of the context increases, the performance of Claude2 drops significantly, which can also be observed in other models.

Even if increasing the maximum context size of BM25 improves recall relative to Oracle files, performance will still decrease because the model simply cannot locate problematic code in the vast lexicon.

Difficulty is not related to the date the problem was solved.

In Table 7, model results by date are shown for PRs created before or after 2023 under the “oracle” search setting.

For most models, with the exception of GPT-4, there is little difference in performance before or after this date.

In addition, the study also found that fine-tuned models are sensitive to changes in context distribution, and it is easier to generate patches than entire files. And large models tend to produce shorter, simpler edits.

LLM cannot replace programmers, but it can speed up workflow

Some netizens have visions and hopes for the future of the “generalist model”.

Yes, this is also my experience. The generalist model is not good enough and does not have a wide enough context length to code itself except for relatively short snippets of code.

But I think it's just a matter of time. I can foresee that in the near future, generalist LLMs with specific training will become very specialized models.

While large models cannot replace programmers, they can speed up their workflow. A team that used to require 10 people may now only require 4 people. This frees up resources for other goals the company has in mind.

Instead of firing people to save money, let developers accomplish great things at breakneck speed!

References:

https://www.reddit.com/r/MachineLearning/comments/1795iiz/can_ai_replace_developers_princeton_and/

https://twitter.com/_carlosejimenez/status/1711714120175681552

https://www.swebench.com/

https://futurism.com/the-byte/stack-overflow-layoffs-ai

https://arstechnica.com/gadgets/2023/10/after-chatgpt-disruption-stack-overflow-lays-off-28-percent-of-staff/?comments=1&comments-page=1