#146 Will AI Models Run Out of Data? A Closer Look at the Data Dilemma
Fresh & Hot curated AI happenings in one snack. Never miss a byte 🍔
This snack byte will take approx 3 minutes to consume.
AI BYTE # 📢: Will AI Models Run Out of Data? A Closer Look at the Data Dilemma
As we delve into the intricate world of AI, one pressing question looms large:
Will AI models eventually run out of data?
This seemingly paradoxical concern arises from the very essence of AI’s success—its insatiable appetite for training data.
Anthropic cofounder and AI Index Steering Committee member Jack Clark has stated in recent times that foundation models have been trained on meaningful percentages of all the data that has ever existed on the internet.
The growing data dependency of AI models has led to concerns that future generations of computer scientists will run out of data to further scale and improve their systems.
The Data Fueling AI
To appreciate the gravity of this issue, let’s first understand why data is crucial for AI. High-quality data serves as the raw material for training powerful, accurate, and high-performing AI algorithms. Consider the following examples:
ChatGPT: This conversational AI model was trained on a staggering 570 gigabytes of text data, equivalent to about 300 billion words. Without such extensive data, ChatGPT’s responses would lack depth and accuracy.
DALL-E and Lensa: These image-generating AI apps owe their creative prowess to the LIAON-5B dataset, comprising 5.8 billion image-text pairs. The quality of this data directly impacts the quality of their generated images.
The Looming Data Crisis
While AI models continue to evolve, our data resources face limitations. Recent research by Epoch AI paints a concerning picture. Here’s what they found:
High-Quality Text Data: If current AI training trends persist, we could exhaust high-quality text data by 2026. Yes, that’s right—within a few short years.
Low-Quality Text and Image Data: The situation worsens for low-quality data. Language data of inferior quality may run dry between 2030 and 2040, while image data could follow suit between 2030 and 2060.
Research from Epoch suggests that these concerns are somewhat warranted. Epoch researchers have generated historical and compute-based projections for when AI researchers might expect to run out of data.
The historical projections are based on observed growth rates in the sizes of data used to train foundation models. The compute projections adjust the historical growth rate based on projections of compute availability.
Synthetic Data: A Ray of Hope?
To combat this impending data scarcity, researchers propose using synthetic data—data generated by AI models themselves. The allure lies in its potential to fill gaps where naturally occurring data is sparse.
The use of synthetic data for training AI systems is particularly attractive, not only as a solution for potential data depletion but also because generative AI systems could, in principle, generate data in instances where naturally occurring data is sparse—for example, data for rare diseases or underrepresented populations.
For example, it is possible to use text produced by one LLM to train another LLM.
However, the road to synthetic data isn’t without pitfalls.
Until recently, the feasibility and effectiveness of using synthetic data for training generative AI systems were not well understood. However, research this year has suggested that there are limitations associated with training models on synthetic data.
For instance, a team of British and Canadian researchers discovered that models predominantly trained on synthetic data experience model collapse, a phenomenon where, over time, they lose the ability to remember true underlying data distributions and start producing a narrow range of outputs.
With each subsequent generation trained on additional synthetic data, the model produces an increasingly limited set of outputs. In statistical terms, as the number of synthetic generations increases, the tails of the distributions vanish, and the generation density shifts toward the mean.
This pattern means that over time, the generations of models trained predominantly on synthetic data become less varied and are not as widely distributed. This phenomenon occurs across various model types, including Gaussian Mixture Models and LLMs. This research underscores the continued importance of human generated data for training capable LLMs that can produce a diverse array of content.
In a similar study published in 2023 on the use of synthetic data in generative imaging models, researchers found that generative image models trained solely on synthetic data cycles—or with insufficient real human data—experience a significant drop in output quality.
The authors label this phenomenon Model Autophagy Disorder (MAD), in reference to mad cow disease.
Models predominantly trained on synthetic data also suffer from model collapse. Over time, they forget true data distributions, yielding a narrow range of outputs. As synthetic generations increase, variety diminishes.
The study examines two types of training processes: fully synthetic, where models are trained exclusively on synthetic data, and synthetic augmentation, where models are trained on a mix of synthetic and real data.
In both scenarios, as the number of training generations increases, the quality of the generated images declines. The image above highlights the degraded image generations of models that are augmented with synthetic data; for example, the faces generated in steps 7 and 9 increasingly display strange-looking hash marks.
From a statistical perspective, images generated with both synthetic data and synthetic augmentation loops have higher FID scores (indicating less similarity to real images), lower precision scores (signifying reduced realism or quality), and lower recall scores (suggesting decreased diversity).
While synthetic augmentation loops, which incorporate some real data, show less degradation than fully synthetic loops, both methods exhibit diminishing returns with further training.
The Way Forward
While the data dilemma is real, it need not spell doom for AI. Researchers continue to explore hybrid approaches, combining real and synthetic data. Additionally, refining AI’s ability to generate diverse content remains critical.
So, will AI models truly run out of data? Perhaps not—but the quest for data abundance and quality continues, ensuring AI’s sustained growth and impact on our world.