Wikipedia provides its data to AI developers to defend against bot harvesting tools

Wikipedia is trying to stop artificial intelligence developers from copying the platform by publishing datasets specifically optimized for training AI models.The Wikimedia Foundation announced on Wednesday that it has partnered with Google's data science community platform to host machine learning dataKaggle collaborates to release beta dataset of "English and French Structured Wikipedia Content".

Kaggle_SS_1920x1080_v3.width-1000.format-webp.webp

Wikipedia says Kaggle-hosted datasets are “designed with machine learning workflows in mind,” making it easier for AI developers to access machine-readable article data for modeling, fine-tuning, benchmarking, alignment, and analysis. Content in the dataset is publicly licensed as of April 15 and includes study abstracts, short descriptions, image links, infobox data and article chapters, but does not include non-written elements such as references or audio files.

Wikipedia says Kaggle users can consume "well-structured Wikipedia content in JSON format," which should be more attractive than "crawling or parsing raw article text." Wikipedia's servers are currently under significant strain as automated AI bots continue to consume the platform's bandwidth. Wikipedia already has content-sharing agreements with Google and the Internet Archive, but a partnership with Kaggle should make the data more accessible to smaller companies and independent data scientists.

"As a tool and testing platform for the machine learning community, Kaggle is excited to be the hosting platform for Wikimedia Foundation data," said Brenda Flynn, Head of Partnerships at Kaggle. "Kaggle is excited to play a role in ensuring the accessibility, usability, and utility of this data."