According to news released by WIRED, many websites in the United States have begun to block the snapshot function of the Wayback Machine of the Internet Archive, that is, the Wayback Machine is no longer allowed to capture the pages of these news websites and archive them. The reason is that AI crawlers capture data and use it to train models.

The current artificial intelligence boom has caused a large number of website traffic to decline significantly, and AI companies are finding ways to bypass restrictions and illegally crawl website content, and ultimately use the captured data for AI conversational robots or for training subsequent artificial intelligence models.

For websites, this behavior involves crawling and using content without permission, and will cause website traffic to decline. Therefore, many websites have explicitly prohibited artificial intelligence search crawlers from crawling website data in robots.txt.

Both the Internet Archive and its users were killed by mistake:

In order to protect their legitimate rights and interests, many well-known news media, including USA Today, the New York Times, etc., have blocked the Internet Archive's website time machine. These news websites exclude the ia_archiverbot crawler, which is the crawler used by the Internet Archive.

In addition to news media, online forums such as Reddit also prohibit the Internet Archive from crawling content. Reddit has signed licensing agreements with Google and OpenAI to allow these companies to crawl data and use it to train artificial intelligence models. At least for Reddit, if the Internet Archive is allowed to crawl data, and AI companies then crawl the Internet Archive's data, it may not be able to continue to sell data.

The problem is that a lot of content does not exist permanently. The significance of the website time machine is that you can view changes in web page content and continue to browse the content through snapshots when the web page is deleted. This is very important to many users.

Therefore, under the AI ​​craze, the news media blocking the Internet Archive from crawling data is actually a manslaughter of the Internet Archive and users: in order to block AI companies and then block users who normally use related functions.

USA Today said this was not directed at the Internet Archive:

A spokesperson for USA Today said that blocking content crawled by the Internet Archive is not specifically targeting the Internet Archive. It is the company's normal plan to broadly block all web crawlers.

The Guardian's director of commercial affairs and licensing said the company is communicating with the Internet Archive to discuss the possible misuse of artificial intelligence companies to crawl content for preservation purposes (but there is no clear result yet).

Judging from this situation, more and more media may block the Internet Archive in the future to prevent their content from being crawled by AI companies through the Internet Archive. In the final analysis, the root cause is still these AI companies.

It is not uncommon for these AI companies to crawl content without authorization and crawl content at high frequencies. Ultimately, this may change the landscape of the open Internet, allowing more websites to shift from public access to registered login access or even paid access.