#158 AI’s Insatiable Appetite for Data is Never Ending
Fresh & Hot curated AI happenings in one snack. Never miss a byte 🍔
This snack byte will take approx 3 minutes to consume.
AI BYTE # 📢: AI’s Insatiable Appetite for Data is Never Ending
In the fast-paced growth era of AI, the demand for data has skyrocketed, leading to a critical juncture where the supply of high-quality text data is at risk of being outstripped within a mere two to four years.
This prediction, as reported by Deepa Seetharaman in The Wall Street Journal, underscores a looming crisis in the AI industry: the internet, our vast repository of knowledge and communication, may not be expansive enough to satiate AI’s growing hunger for data.
The Race for Tokens: GPT-4 and Beyond
OpenAI’s GPT-4, a marvel of generative AI, was trained on an astonishing 12 trillion tokens—words and parts of words that form the building blocks of its learning. Yet, the anticipated GPT-5 looms on the horizon, with an appetite for up to 100 trillion tokens, dwarfing the entirety of useful language and images currently available on the web.
This insatiable demand has sparked a fierce competition for the dwindling data reserves, leading companies to scour every digital nook and cranny, from chat logs to long-forgotten personal photos on defunct social media platforms.
The Resurrection of Digital Relics
Tech companies are now vying to license the 13 billion photos and videos from the once-dominant image-hosting site Photobucket, breathing new life into the relics of the MySpace and Friendster era.
This scramble for data has pushed some AI firms to the ethical brink, with reports of corner-cutting, policy-ignoring, and law-bending practices to acquire the necessary data.
After scrapping the internet and realizing that there are not much avenues left, ad any further scrapping will put them in legal issues with content creators/news publications - OpenAI has struck a licensing deal with Wall Street Journal.
In recent weeks, it has also inked licensing deals with The Atlantic and Vox Media
Vox Media, known for properties like Vox, The Verge, and Eater, will license content to OpenAI. OpenAI will get access to Vox Media's archives to help the Microsoft-backed company enhance its technology and its viral chatbot ChatGPT's output.
The Atlantic said it is creating an "experimental microsite, called Atlantic Labs," that will also pilot OpenAI's tech, helping the media firm explore how AI can drive development of new products and features.
The Atlantic's CEO Nicholas Thompson said "There's a lot of fear in the media industry about partnering with tech platforms. But I'm absolutely convinced these deals can be beneficial, if we've learned the right rules, structure them the right way, and hedge our bets,"
The agreements with The Atlantic and Vox Media come on the heels of several media firms signing similar deals, giving OpenAI access to their news content and archives to train its large language models.
The Ethical Quandary of Data Harvesting
The pursuit of data has not been without controversy. Meta’s internal discussions about acquiring the publishing house Simon & Schuster to access long-form content, and OpenAI’s use of its speech recognition tool, Whisper, to transcribe over 1 million hours of YouTube videos, potentially flout platform terms of service and raise questions about the lengths to which companies will go to feed their AI models.
The Original Data Harvesters: Facebook and Google
Facebook and Google, pioneers in the data harvesting race, have built empires on the collection and sale of consumer data to advertisers. Yet, as Google expresses discontent over OpenAI’s use of YouTube videos for AI training, it faces accusations of hypocrisy, given its own scraping of YouTube content.
This is akin to “throwing stones in glass houses,” with tech giants more concerned about protecting their proprietary interests than user privacy.
Synthetic Data: A Double-Edged Sword
With the quality data pool running dry, AI companies are turning to synthetic data—data generated by AI itself—as a potential solution. However, this approach is not without its pitfalls.
But relying on synthetic data risks perpetuating the biases and inaccuracies of the AI that produces it, drawing parallels to the genetic stagnation of the Habsburg dynasty.
The fear is that this self-referential training could lead to an “inbred mutant” AI, devoid of the diversity and richness of real-world data.