OpenAI denounces The New York Times: Deliberately guiding ChatGPT to conclude plagiarism

OpenAI’s counterattack is coming. In response to the most high-profile infringement lawsuit in history filed by the New York Times, OpenAI published a long article to express its position. The article directly stated: The entire lawsuit is baseless, and pointed out that the New York Times:

There is a suspicion of deliberately guiding ChatGPT

Concealing information and not telling the complete story

And the overall view from OpenAI is:

(1) The use of copyrighted data for training is reasonable. Without them, where would the most advanced models in the world today come from?

(2) What if you don’t want to be trained? Can exit. The absence of a single data source, including the New York Times, also had no significant impact on the model's performance.

As soon as the news came out, the melon-eating crowd gathered quickly again and became a quarrel.

Support OpenAI's direct "shrimp pig heart":

The New York Times withdrawing from the training data set will actually make the model output quality better (Doge)

Someone asked the model GPT-4 what it thought, and the AI mocked the New York Times mercilessly:

Ng Enda also wrote a lot eloquently, in summary:

I sympathize more with OpenAI than the New York Times. The full-text plagiarism mentioned by the latter is more likely to be caused by the RAG mechanism, and it has been measured that OpenAI has plugged the loophole. It is questioned how much actual damage the New York Times has suffered.

However, the opposing netizens were also merciless, pointing directly at their noses and scolding:

OpenAI, you have too many double standards. Whatever training is reasonable is just to maximize your interests.

You're the one not telling the full story.

OpenAI specific response

Let’s first take a look at the specific stance of OpenAI’s response, which includes four points:

1. Very willing to cooperate with news organizations

OpenAI said it worked hard to support news organizations during the technology design process, meeting with dozens of relevant media outlets, listening to their concerns and providing solutions.

Its original intention is also to support a healthy news ecosystem and achieve mutual benefit, including:

(1) By deploying their products to assist journalists with time-consuming tasks such as analyzing large amounts of public records and translating stories, editors and reporters ultimately benefit.

(2) Teach their AI models world knowledge by training on historical, non-public content.

(3) Display real-time content with attribution information in ChatGPT answers to establish a connection between news publishers and readers.

2. Training is fair use and an exit mechanism is provided.

OpenAI previously warned in a submission to the British House of Lords:

Our model would not work without training on copyrighted content.

Here, OpenAI once again stated that it is reasonable to use public Internet materials to train AI models, which is fair to creators, necessary for innovators, and crucial to the country’s competitiveness.

He also pointed out that this view has been supported by many groups and scholars in the United States. In other countries and regions, such as the European Union, Japan, Singapore, etc., there are even laws supporting the training of copyrighted content.

However, the topic has changed. In line with the principle that "legal rights are not as important to us as being good citizens," OpenAI said that it provides a simple exit process to prevent their AI models from accessing these website data again.

According to reports, the New York Times has adopted this mechanism in August 2023 to withdraw from OpenAI training.

3. "Reflux" is a rare error. We hope users will not intentionally cause it.

The so-called "regurgitation" actually refers to the model output and the training data being exactly the same.

The New York Times listed in the lawsuit the striking similarities between ChatGPT and the news company:

Some netizens were dissatisfied with this formal expression: Isn't it plagiarism?

But anyway, OpenAI’s explanation is:

This rare error occurs when a specific content appears multiple times in the training data, but we have taken steps to prevent this.

Moreover, OpenAI also specifically advises users:

Act responsibly and do not intentionally manipulate models to regurgitate, which is both an inappropriate use of our technology and a violation of our Terms of Use.

However, Marcus and a digital illustrator jointly wrote an article a few days ago, listing how AI models including DALL-E3 "regurgitated data" without explicit prompts, that is, giving some pictures and other content that are obviously similar to scenes in existing works.

And this makes OpenAI's statement somewhat contradictory.

Finally, at the end of this paragraph, OpenAI also said:

The model learns from a huge collection of human knowledge, so any one type of data (including news) is only a small part of the overall training data, and any single data source (including the New York Times) is not important to the model's knowledge learning.

4. The complete story was hidden, and I was surprised and disappointed after receiving the lawsuit.

OpenAI revealed that on December 19 last year, it had actually made constructive progress in negotiations with the New York Times, including real-time display of sources and jumps in answers, and explained to the New York Times:

As with any single source, your content does not contribute meaningfully to the training of our existing models and will not have sufficient impact on future training.

However, OpenAI said it did not expect that it would be directly sued on December 27, and it only learned about it through the New York Times - the whole mood was one of surprise and disappointment.

Here, OpenAI pointed out that regarding the "reflux" situation pointed out by the New York Times (that is, answering verbatim transcripts of the New York Times news), they worked hard to solve this problem and showed sincerity. They also asked the latter to share examples, but were repeatedly rejected.

What’s even more interesting is that OpenAI discovered that the so-called “reflux” content was actually articles that were widely circulated on multiple third-party websites many years ago (that is, not from the New York Times).

And the New York Times may be suspected of deliberately manipulating the prompt words-putting in large paragraphs of the original text to "fool" the model.

OpenAI said that according to their operation, the model is not as exaggerated as shown by the New York Times.

This shows that they either deliberately guided the model or made a careful selection.

Based on the above, OpenAI believes:

The New York Times’ lawsuit is without merit.

However, there are also gentle scenes:

We still want a partnership with it, which reported the first working neural network 60 years ago.

Review

On December 27 last year, the New York Times suddenly submitted a pleading and 220,000 pages of attachments to the district court, suing OpenAI for infringement, and of course Microsoft.

The complaint states that the New York Times articles constitute the largest single proprietary data set used in CommonCrawl to train GPT.

Based on this, they found as many as 100 irrefutable evidence that the output content of ChatGPT is almost identical to the news content of the New York Times.

And sometimes due to hallucination problems, the model will also "spread rumors" in the name of the New York Times, generating some fake news, such as orange juice can cause lymphoma, which also causes trouble for their reputation.

In this regard, the New York Times’ appeal is:

OpenAI and Microsoft are required to destroy models and training data containing infringing material and be responsible for "billions of dollars in statutory and actual damages" related to the illegal copying and use of The New York Times' uniquely valuable works.

Due to sufficient evidence and a strong team of lawyers, netizens called this a "milestone case that witnessed AI infringement" and "I'm afraid it can no longer be dismissed like it was with other publishers before."

It is understood that the New York Times negotiated with OpenAI in April last year, but failed to reach an agreement and OpenAI refused to reach an agreement.

The reason may be the huge amount, especially considering the growth of OpenAI's profits and the increase in similar cases.

One wild guess is that OpenAI may want to settle the matter with a seven- to eight-figure amount (millions of dollars/ten million dollars), but what the New York Times is pursuing is higher compensation and continued royalty income.

Ps. OpenAI’s annual revenue is around US$1.6 billion, and the annual amount spent on purchasing authorized articles and materials for training is between US$1 million and US$5 million.

This time, where do netizens stand?

Some netizens pointed out that the key to this case is "whether training is a fair use", and he believes:

The output of the model may be infringing, but the input is not.

But someone sarcastically said:

When you have billions of dollars, everything is fair use.

Some people also suggested:

I agree to fair use, but only if you open source it.

And someone else echoed:

It’s really important to emphasize nonprofit organizations.

In addition, a writer and netizen expressed dissatisfaction with the exit mechanism proposed by OpenAI and received a lot of support:

It's not enough to opt out and disable your models from reading my personal website. I also need you to double-check and completely remove my content from the training data.

How will it end?

A survey showed that 59% of respondents believe that artificial intelligence companies should not be allowed to use publisher content to train models.

And 70% said companies should compensate publishers if they want to use copyrighted material in model training.

It seems that public opinion is on the side of the New York Times.

How do you think this case should be decided?

Reference links:

[1]https://openai.com/blog/openai-and-journalism

[2]https://x.com/OpenAI/status/1744419710635229424?s=20

[3]https://www.ft.com/content/04861d1e-2e9f-4b92-a294-8d0c223a8287

[4]https://techcrunch.com/2024/01/08/openai-claims-ny-times-copyright-lawsuit-is-without-merit/

[5]https://www.theregister.com/2024/01/08/midjourney_openai_copyright/

[6]https://x.com/AndrewYNg/status/1744433663969022090?s=20

[7]https://x.com/futuristflower/status/1744422698636218807?s=20