Elon Musk says AI has already gobbled up all human-produced data to train itself and now relies on hallucination-prone synthetic data

AI requires massive resources—spanning water, energy, and an estimated $1 trillion in investments—but Elon Musk has highlighted another crucial bottleneck: the depletion of human-generated data needed for training.

Speaking with Stagwell CEO Mark Penn in an interview streamed on X, Musk explained that AI systems learn by consuming vast troves of human-created content, including the internet, books, and videos. However, this reservoir of information has been largely exhausted.

“The cumulative sum of human knowledge has been exhausted in AI training,” Musk said. “That happened basically last year.”

To continue evolving, AI systems increasingly rely on synthetic data—artificially generated information. Musk likened the process to an AI system writing an essay and then grading it itself.

Major tech companies, including Microsoft, Google, and Meta, are already leveraging synthetic data. For instance, Google DeepMind trained its AlphaGeometry system on 100 million synthetic examples to bypass the limitations of human-generated data. OpenAI recently introduced an AI model capable of fact-checking its own output to refine its learning process.

Challenges of Synthetic Data

While synthetic data offers a solution, Musk cautioned that its use raises concerns. It amplifies the risk of "hallucinations," where AI generates incorrect or nonsensical information it believes to be true. This phenomenon, often called "AI slop," has already contributed to the spread of unreliable content online. Meta’s president of global affairs, Nick Clegg, acknowledged in February the need for clear distinctions between human-generated and synthetic content, saying, “As the difference between human and synthetic content gets blurred, people want to know where the boundary lies.”

Human Data: A Finite Resource

The finite nature of human-created data for AI training is widely recognized. A June study by research group Epoch AI predicts that publicly available data for training large AI models could run out between 2028 and 2032. The depletion of training resources may hinder AI development, as scaling up models has been essential for improving their capabilities.

“There is a serious bottleneck here,” said Tamay Besiroglu, a co-author of the study. “If you start hitting those constraints about how much data you have, then you can’t really scale up your models efficiently anymore.”

Data scarcity isn’t solely due to AI consumption; it’s also driven by growing restrictions from data owners. A July study by the MIT-led Data Provenance Initiative found that some web domains have reduced AI access to their content by as much as 45%, reflecting a trend of data owners seeking fair compensation or control over their information.

The Future of AI Training

Despite these challenges, tech companies are adapting. Alongside synthetic data, they are leveraging private datasets and striking deals with content providers. OpenAI has reportedly resorted to transcribing podcasts and YouTube videos for training, though such practices raise potential copyright issues.

Synthetic data remains a central focus for future AI training. OpenAI CEO Sam Altman acknowledged the impending data shortage at the 2023 Sohn Conference Foundation but expressed optimism about synthetic data.

“As long as you can get over the synthetic data event horizon where the model is good enough to create good synthetic data, I think you should be alright,” Altman said.