Generative AI is on the verge of collapse. Well, that’s according to a recent study on machine learning.
The study, helmed by a University of Oxford team including Google DeepMind researcher Ilia Shumailov, suggests that the “indiscriminate” use of model-generated content to train large language models (LLMs) such as ChatGPT could cause “irreversible defects” in the resulting models as part of a model collapse.
Quoting from the paper, “Long-term poisoning attacks on language models are not new — for example, we saw the creation of click, content and troll farms, a form of human ‘language models’ whose job is to misguide social networks and search algorithms.
“What is different with the arrival of LLMs is the scale at which such poisoning can happen once it is automated.”
As the old saying goes, “garbage in, garbage out,” and if much of the AI-generated content that is being created is garbage (filled with misinformation and hallucinations)… And if that text makes its way into the training data of products like ChatGPT, Gemini, or a host of other LLMs, then it can significantly reduce the quality of outputs.
This is the world of model collapse — irreparable gibberish under the guise of good answers.
However, the keyword here is if. Model collapse can be prevented if researchers take the necessary precautions to process and curate synthetic data when collecting information from the internet.
Key Takeaways
- Model collapse occurs when AI models generate inaccurate outputs due to low-quality training data.
- Many researchers consider model collapse a plausible risk, especially when using synthetic data.
- Ensuring high-quality input data and human oversight can help reduce the risk of model collapse.
- The risk of model collapse decreases with careful data curation and improvements in AI-generated content quality.
- Techniques like reinforcement learning with human feedback will also help improve or retain AI model performance.
How Likely is a Model Collapse?
While there’s lots of hype, anxiety, and sensationalist takes surrounding AI, many researchers believe that the idea of a model collapse is a plausible risk that needs to be addressed during development, particularly if they’re using synthetic data.
Thomas Randall, director of AI market research at Info-Tech Research Group, said: “Model collapse is a risk that organizations should be mindful of, especially if they are using AI models to create synthetic data.
“Synthetic data refers to when an AI model generates statistically identical information to the real-world data it was trained on.
“Examples might be generating data for training exercises, data model testing, or patient data simulation. The danger is when these models using synthetic data as training input and produce inaccurate outputs or perpetuate errors. The result is a degradation in the performance of the AI model.”
The risk of a model collapsing increases the more a model is trained on low-quality data.
So researchers need to do their due diligence on what type of data is being used to train the model. Randall notes that companies can also ask AI vendors what data has been used to fine-tune AI models.
Micah Adams, Principal DevOps at Focused Labs also agrees that model collapse is a risk that needs to be taken seriously.
“I think model collapse is a credible risk. As we constantly train AI with the corpus of information available on the internet and then publish and share data generated by AI back to the same source, we effectively poison the well.
“Our enthusiasm for the convenience in generating data with AI and LLMs in an unregulated fashion makes this threat all too real.”
How Big is the Risk of Model Collapse?
Although many researchers agree that model collapse is a realistic possibility, Nikolaos Vasloglou, VP of Research ML at RelationalAI told Techopedia it is unlikely if researchers properly prepare model input data.
“I think it is close to zero if data scientists follow the standard guidelines for data prep. Bad data can always sneak into your training dataset, whether you collect them from the web, other LLMS, simulators, etc.
“So cleaning them up has always been a tedious but mandatory process.”
In this sense, if researchers are carefully curating and cleaning data as they should be anyway, then there’s less risk of a model degrading.
At the same time, it’s also important to consider that the quality of AI-generated content will improve over time, so the contamination effect caused by synthetic data is going to decrease.
As Vasloglou explained, the authors of Llama 3.1 have shared how they’re using synthetic data generation, and iterating multiple times to produce higher quality synthetic data over time.
Preventing Model Collapse
Addressing model collapse comes down to ensuring human oversight of what data is being input into a model, selecting data from diverse sources, and making sure there’s complete transparency over how it’s being processed.
“Critical human oversight is paramount to ensuring an AI model maintains its performance.
“Using industry-standard techniques, such as reinforcement learning with human feedback, will control model quality — and large model providers, such as OpenAI and Anthropic, frequently use crowd-working companies to improve their models (such as Surge AI or Appen),” Randall said.
Frequently using human oversight to review and update AI models can help prevent issues like data drift or biases, which negatively impact the quality of outputs.
Above all, the key to avoiding model collapse appears to be sticking to high-quality input data. Using sources such as Common Crawl web data can be useful for training models, but as more AI-generated content is produced, researchers are going to source more consistent data sources.
The Bottom Line
Model collapse is definitely a risk to consider, but it’s not something to get too worked up about. But it is a risk to tackle early, otherwise there are consequences for both AI and the internet at large.
Companies using high-quality input data and techniques like reinforcement learning with human feedback will likely have a minimal chance of model degradation.