Artificial intelligence needs data to train, operate, and evolve — it is like energy for AI.
The more data it has, and the higher the quality of that data, the better the AI can perform its tasks and improve over time.
But what would happen if the world runs out of data?
A recently revised published paper found that if AI development trends continue, all data available online will be exhausted somewhere between 2026 to 2032 — or even earlier if models are overtrained.
The first author of the study, Pablo Villalobos, from the research institute Epoch AI, spoke about the study with Live Science.
“If chatbots consume all of the available data, and there are no further advances in data efficiency, I would expect to see a relative stagnation in the field. Models [will] only improve slowly over time as new algorithmic insights are discovered and new data is naturally produced.”
If data does run out, researchers say private data and synthetic data will emerge as leading solutions. But not everyone is convinced this situation will ever become a reality.
Key Takeaways
- A recent study predicts AI could exhaust all publicly available data by 2026-2032, hindering further advancements without access to new information.
- Experts propose solutions like private data, transfer learning from rich data sets, and even synthetic data generation to address potential data limitations.
- While some argue AI’s ability to learn and conduct research will allow it to create its own data sources, lessening dependence on external data, others say AI would be limited without data.
Mikhail Dunaev, Chief AI Officer at ComplyControl, a provider of AI-powered risk management and compliance services, told Techopedia that he disagrees with the study findings that suggest that the growth of large language models (LLMs) is limited by a lack of data.
“I believe that, at this point, there is already enough data available, and future AI development will focus on improving learning algorithms rather than acquiring more data.”
“The study predicts a data shortage in a few years, but given the current rapid speed of AI development, it’s hard to make forecasts so far ahead,” Dunaev said. “Furthermore, over the course of this time, humanity will continue to generate even more data and research, aided by AI itself.”
Jim Kaskade, CEO at Conversica, a Conversational AI provider, also spoke to Techoipedia about the study. Kaskade recognized that the study’s methodology and projections are robust and well-founded.
“However, we have to take into account the dynamic nature of the internet and data generation – over 2.5 quintillion bytes of data is created every day.
“Social platforms generate 100 trillion of text annually, tweets of 1.5 trillion text content per year. YouTube alone has over 260 million hours of video uploaded per year. People capture and share over 1 trillion photos each year as well.”
Dmytro Shevchenko, Classic Machine Learning, Computer Vision, and Natural Language Processing expert, and Data Scientist at Aimprosoft, a custom software development company, told Techopedia that while he agrees with the study, the conclusions are incomplete as they do not account for new changes.
“For example, improvements in data compression algorithms and optimization techniques may significantly reduce the need for vast data.”
“In addition, the use of synthetic data and transfer learning seems promising, but the research does not take into account all possible complexities and limitations of these methods.”
AI Companies Emerge By The Thousands
The AI ecosystem is expanding, with new companies developing, integrating, and applying AI on the rise. This exponential rise of new AI companies is recognized by the study as one of the factors impacting data availbility and usage.
According to the global startup data platform Tracxn, as of June 27, there are 75,741 companies working in AI. Some of these are top leaders or AI startups expected to blow up in 2024. The number of companies in the sector grows about 10% every month.
Shevchenko told Techopedia that this growth leads to inevitable impacts global data.
“At the current rate of development of LLMs, with the number of new organizations actively working with LLMs increasing by 10% each month for the past five years, it goes without saying that the threat that available public textual data will be exhausted by 2032 hangs over us.”
Can AI Technology Evolve Without Data?
One of the study’s conclusions is that without data, AI tech advancements are not possible. Kaskade from Conversica told Techopedia that without access to new data, AI advancements will be hindered.
“The study highlights that LLMs rely heavily on large-scale, high-quality data for training,” Kaskade said. “A lack of new data would limit these models’ ability to learn from evolving trends and contexts, reducing their effectiveness and accuracy.
“However, the study also suggests potential solutions such as synthetic data generation, transfer learning from data-rich domains, and improvements in data efficiency.”
While Kaskade expressed reservations about synthetic data, he said it could help maintain AI development momentum by providing alternative data sources, even in the absence of new human-generated data.
¨If AI were to run out of data due to resource constraints or otherwise, I would assume providers would simply purge the old data in the interest of capturing the new — aside of models trained specifically on prior periods, requiring no recent data to perform their tasks.”
If synthetic data, learning transfers, and the private data industry fail to meet the demands of future AIs, the technology will reach a performance plateau, Kaskade said. The result would be something similar to model drift — a situation where a model’s performance degrades over time because the data it was trained on becomes outdated or irrelevant.
“This would result in models becoming less effective over time as they fail to incorporate new information and trends. Secondly, the absence of fresh data could lead to overfitting, where models become too specialized on the existing data and perform poorly on any new tasks.”
Dunaev from ComplyControl said the answer is not acquiring more data but optimizing algorithms. “Given the current pace of development and AI’s capability to generate new data and research, a lack of data is not a significant limitation for future progress,” Dunaev said.
“If AI does run out of data, it will still improve by optimizing learning algorithms and conducting its own research to get new data. So, even with limited data, AI will be able to keep growing and getting better.”
Shevchenko from Aimprosoft is unsure whether the AI models will evolve without problems if there is a data crisis.
“Real data is the backbone of the foundation in AI development as it provides diverse, rich, and contextually relevant information that allows models to learn, adapt, and generalize efficiently to different scenarios,” Shevchenko said.
“Synthetic data generation, transfer learning, and data optimization techniques can mitigate the impact of data scarcity. However, these methods cannot fully replace the richness and contextual relevance of real data.”
The Bottom Line
While AI could devour all publicly available data in the coming years, the future of AI development is a complex issue with no easy answers. Experts disagree on the severity of the data shortage and propose various solutions.
The ever-growing field of AI, with new companies emerging monthly, will undoubtedly place a strain on data availability. However, advancements in data efficiency, transfer learning, and even synthetic data generation have the potential to mitigate the impact of a data shortage.
The bottom line? The future of AI may be a bright one, but the journey will require innovation and a multi-pronged approach to data management and utilization.
References
- [2211.04325] Will we run out of data? Limits of LLM scaling based on human-generated data (Arxiv)
- AI models could devour all of the internet’s written knowledge by 2026 | Live Science (LiveScience)
- Mikhail Dunaev – ComplyControl | LinkedIn (Linkedin)
- ComplyControl | Xdata Group Ltd (ComplyControl)
- Jim Kaskade – Conversica | LinkedIn (Linkedin)
- Conversica | AI-Powered Conversations to Unlock Revenue (Conversica)
- Enterprise Software Development Services Company – Aimprosoft (Aimprosoft)
- Top 10 companies and startups in Artificial Intelligence in the world in May 2024 – Tracxn (Tracxn)