Artificial intelligence training data is expensive and is most suitable for technology companies with deep pockets. That's why Harvard University plans to release a public data set of about 1 million public domain books across a variety of genres, languages ​​and authors, including Dickens, Dante and Shakespeare, that are no longer protected by copyright due to their age.

The new dataset hasn't been released yet, and it's unclear when or how it will be released. The books it contains are from Google Books, Google's long-term book scanning project, so Google will be involved in releasing "the broad applications of this trove of books."

Harvard University first previewed the Institutional Data Initiative (IDI) back in March, outlining its plans to create a "trusted channel for artificial intelligence legal data." However, there has been little news about the program until its official launch today, with IDI receiving financial backing from Microsoft and OpenAI.

Greg Leppert, IDI's executive director, said the dataset is intended to "level the playing field" by opening such a massive dataset to anyone who wants to train large language models (LLMs), from research labs to AI startups.