AI is running out of internet to consume. While you and I log onto this worldwide web of ours to enjoy (or, perhaps not), educate, and connect, companies use that data to train their large language models (LLMs) and grow their capabilities. It’s how ChatGPT knows not only factual information, but how to string together responses as well: Much of what it “knows” is based on an enormous database of internet content.

But while many companies rely on the internet to train their LLMs, they’re running into a problem: The internet is finite, and the companies developing AI want them to continue growing—rapidly. As the Wall Street Journal reports, companies like OpenAI and Google are facing this reality. Some industry estimates say they’ll run out of internet to burn through in about two years, as both high-quality data becomes scarce, and as certain companies keep their data out of the hands of AI.

AI needs a lot of data

Don’t underestimate the amount of data these companies need, now and in the future. Epoch researcher Pablo Villalobos tells the Wall Street Journal that OpenAI trained GPT-4 on roughly 12 million tokens, which are words and portions of words broken down in ways the LLM can understand. (OpenAI says one token is about .75 words, so 12 million tokens is roughly nine million words.) Villalobos believes that GPT-5, OpenAI’s next big model, would need 60 to 100 trillion tokens to keep up with the expected growth. That’s 45 to 75 trillion words, per OpenAI’s count. The kicker? Villalobos says after exhausting all the possible high-quality data available on the internet, you’d still need anywhere from 10 to 20 trillion tokens, or even more.

Even still, Villalobos doesn’t believe this data shortage will really hit until about 2028, but others aren’t so optimistic—especially AI companies. They see the writing on the wall, and are looking for alternatives to internet data to train their models with.

The AI data problem

There are, of course, a few issues to contend with here. First is the aforementioned data shortage: You can’t train an LLM without data, and giant models like GPT and Gemini need a lot of data. The second, however, is the quality of that data. Companies won’t scrape every conceivable corner of the internet, because there is a deluge of garbage on here. OpenAI doesn’t want to pump misinformation and poorly-written content into GPT, since its goal is to create an LLM that can respond accurately to user prompts. (We have already seen plenty of examples of AI spitting out misinformation, of course.) Filtering out that content leaves them with fewer options than before.

Finally, there’s the ethics of scraping the internet for data in the first place. Whether you know it or not, AI companies have probably scraped your data and used it to train their LLMs. These companies, of course, don’t care about your privacy: They just want data. If they’re allowed to, they’ll take it. It’s a big business, too: Reddit is selling your content to AI companies, in case you didn’t know. Some places are fighting back—the New York Times is suing OpenAI over this—but until there are true user protections on the books, your public internet data is heading to an LLM near you.

So, where are companies looking to for this new information? OpenAI is leading the charge. For GPT-5, the company is considering training the model on transcriptions of public videos, such as those scraped from YouTube, using its Whisper transcriber. (It seems possible the company already used the videos themselves for Sora, its AI video generator.) OpenAI is also working to develop smaller models for particular niches, as well as on developing a system for paying providers of information based on how high-quality that data is.

Is synthetic data the answer?

But perhaps the most controversial next step some companies are considering is using synthetic data to train models. Synthetic data is simply information generated by an existing data set: The idea is to create a new data set that resembles the original, but is entirely new. In theory, it can be used to mask the contents of the original data set, while giving an LLM a similar set to train on.

In practice, however, training LLMs on synthetic data could lead to “model collapse.” That’s because the synthetic data contains existing patterns from its original data set. Once an LLM is trained on the same patterns, it can’t grow, and may even forget important pieces of the data set. Over time, you’ll find your AI models returning the same results, since it doesn’t have the varied training data to support unique responses. That kills something like ChatGPT, and defeats the purpose of using synthetic data in the first place.

Still, AI companies are optimistic about synthetic data, to a degree. Both Anthropic and OpenAI see a place for this tech in their training sets. These are capable companies, so if they can figure out a way to implement synthetic data into their models without burning down the house, more power to them. In fact, it’d be nice to know my Facebook posts from 2010 aren’t being used to fuel the AI revolution.

Leave a Reply

Your email address will not be published. Required fields are marked *