Model collapse has become one of the AI industry’s most talked-about fears, up there with AI psychosis. Large language models are trained on internet content, an increasing amount of which is produced by other LLMs. The worry is that systems could gradually become less accurate, less diverse, and less connected to reality by duplicating each others’ mistakes instead of learning from humans.
The concern arrives at a strange moment amid a flurry of AI-related IPOs. The internet has spent the last three years filling itself with AI-generated content. From AI-written articles, AI-generated reviews, AI-created social posts, to AI-generated commentary about AI-generated content. As synthetic material floods the web, researchers are asking an uncomfortable question: What happens when future AI systems are trained on all of it?
The theory behind model collapse is simple enough. Like a photocopy of a photocopy, each generation of AI risks becoming slightly blurrier than the last. Rare details disappear first. Unusual perspectives get smoothed away. Over time, models may begin learning from machine-generated approximations of reality rather than reality itself.
The Theory Says Yes — But the Real World Is More Complicated
A landmark 2024 Nature paper put the idea on the map. Researchers found that when generative AI systems are repeatedly trained on data produced by earlier generations of AI, they begin to lose information about the original data distribution. Rare events disappear first. Outputs become less diverse. Eventually, the models start drifting away from reality altogether. The authors described it as a “degenerative process” in which models become poisoned by their own projection of reality.
It’s a compelling theory. It’s also one that has inspired no shortage of headlines warning that AI is slowly eating itself.
“The problem of model collapse is not that one model uses synthetic data. The problem is when the internet turns into a hall of mirrors, and future models cannot tell original human signal from machine-generated recycling.” — Ravi Kiran Pagidi, senior AI and data analytics professional
The reality, experts say, is more complicated.
“Not in the models people actually use,” said Saachin Bhatt, when asked whether he sees AI collapse in the wild.
Bhatt is the founder of applied AI consultancy Brdge.
“There is no credible public evidence,” he continues, “that GPT, Gemini, Claude, or Llama have degraded because of AI-generated content in their training data.”
According to Bhatt, the most dramatic examples of model collapse rely on a fairly unrealistic setup: repeatedly training a model on its own outputs while throwing away the original human-created data.
“Nobody builds models that way,” he said.
That caveat apparently matters.
Even the Nature paper found that preserving a portion of the original human-created data seemed to reduce degradation. Models trained exclusively on synthetic outputs deteriorated much faster than those that retained access to real-world information.
Still, the concern refuses to go away — partly because the internet itself is changing.
Early Warning Signs Are Present
Every day, more synthetic content enters the online ecosystem. Future AI systems may increasingly be trained on datasets that already contain AI-generated text, images, and videos. The fear is that the models gradually become flatter, safer, more repetitive, and less connected to reality.
“A photocopy of a photocopy. Recognisable for a while, until reality fades away to oblivion.” — Edwin Trebels, LangOptima founder
“Model collapse is what happens when AI starts learning from AI instead of from us,” said Edwin Trebels, founder of LangOptima.
“Train a model on the output of earlier models, and it loses the edges of reality first: the rare cases, the outliers, the unusual voice. With each generation, it drifts toward a blander, more average version of the world. A photocopy of a photocopy. Recognizable for a while, until reality fades away to oblivion.”
Ravi Kiran Pagidi, a senior AI and data analytics professional, sees the risk in similar terms.
“In my view, the collapse of AI models is very real, however, it gets misconstrued,” he said.
“It does not mean all of a sudden AI models become irrelevant. What happens is that these models degrade over time due to becoming more generalistic in nature, forgetting unique patterns, replicating old mistakes, and losing ties to the original human-generated knowledge.”
The gradual degradation mirrors the distinction made by the Nature researchers between “early model collapse” and “late model collapse.” Early collapse occurs when models begin losing information about low-probability events, i.e., the strange, rare, and unusual examples found in any dataset. Later stages are more severe, producing outputs that bear little resemblance to the original distribution.
If that sounds abstract, Pagidi offers a more vivid description.
“The problem of model collapse is not that one model uses synthetic data,” he said. “The problem is when the internet turns into a hall of mirrors, and future models cannot tell original human signal from machine-generated recycling.”
Fortunately, most researchers do not believe collapse is inevitable.
Bhatt points to evidence suggesting that retaining human-generated data alongside synthetic data can keep errors under control. The bigger challenge may be figuring out how much AI-generated content can enter the pipeline before quality begins to erode in ways that are difficult to detect.
The proposed solutions are surprisingly mundane. This means better data provenance. Better curation. Better verification. More human oversight. Less blind scraping.
“None of these is exotic,” Bhatt said. “They are mostly discipline.”
The focus on provenance is said to be shaping industry priorities.
“Many of the big AI companies are pushing for content provenance standards and watermarking because it is going to help them identify what data is still authentically human, and therefore still valuable to be trained on,” said Andrew Gamino-Cheong, CTO and co-founder of Trustible.
And that points to what may be the most important consequence of the entire debate.
Model Collapse Is Real, But Not Inevitable
The Nature researchers concluded that authentic human-generated data could become increasingly valuable as AI-generated content floods the web. In a world where synthetic content is abundant, genuine human knowledge may become a scarce resource.
“The catastrophic outcome turns out to be a consequence of a bad pipeline, not an inevitable law.” — Saachin Bhatt, Brdge founder
Bhatt believes that change could reshape competition across the industry.
“Once data quality becomes a performance input rather than a free resource, clean data turns into a moat,” he said. Verified human data, proprietary user interactions, and exclusive licensing agreements may prove more valuable than yet another model release.
Gamino-Cheong argues that established technology companies may already have an advantage because they own many of the tools where humans still create content.
“If model collapse becomes a bigger problem over time, those traditional vendors will have a huge advantage.”
So, if the internet continues filling with synthetic content, the companies best positioned to build the next generation of AI may not be those with the biggest models.
They may be the ones with access to the most human data.
Or, as Pagidi puts it: “The problem is when the internet turns into a hall of mirrors, and future models cannot tell the original human signal from machine-generated recycling.”
