The Internet is a vast treasure house of human knowledge, but it is not inexhaustible. Artificial intelligence (AI) researchers are rapidly depleting these resources. The rapid development of the AI field over the past decade has been largely due to the expansion of neural networks and their training on massive amounts of data. This approach is very effective in developing large language models (LLM), such as the model that drives the chatbot ChatGPT.However, some experts warn that this expansion is approaching its limits. In addition to growing computational energy requirements, another reason is that LLM developers are running out of traditional data sets.
Recently, a high-profile study quantified this issue and sparked widespread concern. Researchers at virtual research institute EpochAI predict that by around 2028, the size of typical data sets used to train AI models will approach the total amount of publicly available text on the Internet. In other words, AI could run out of available training data within four years. At the same time, content owners (such as newspaper publishers) are beginning to take more stringent measures to limit the use of data, further exacerbating the "data sharing" crisis.
Although these limitations may slow down the development of AI systems, developers are actively looking for solutions. For example, well-known AI companies such as OpenAI and Anthropic have publicly acknowledged this problem and hinted that they plan to solve this dilemma by generating new data or finding unconventional data sources. An OpenAI spokesperson said: "We used a variety of sources, including publicly available data, non-public data shared with partners, synthetic data generation, and data provided by AI trainers."
Nonetheless, this data crisis may force a change in the way generative AI models are developed, from large, general-purpose large-scale language models to smaller, more specialized models, thereby changing the landscape of the entire AI ecosystem.