An investigation by Proof News has revealed that companies including Apple, Nvidia, Anthropic, and Salesforce used subtitles from YouTube videos to train generative AI models.
This goes against YouTube’s Terms of Service, which warns against using materials without the permission of creators. In total, subtitles from 173,536 YouTube videos from over 48,000 channels were used by some of the wealthy AI companies.
The dataset, YouTube Subtitles, contained text video transcripts from educational channels like Harvard and MIT alongside publications such as the BBC and The Wall Street Journal. It also included material from leading YouTube creators such as PewDiePie and MrBeast, both of whom have hundreds of millions of subscribers.
Some of the material included in the dataset promoted conspiracy theories like the “Flat-Earth Theory,” while a research paper published by Salesforce revealed developers had raised concerns it included “biases against gender and certain religious groups” alongside profanity — information which was corroborated by Proof News.
The creation of the dataset may also have violated YouTube’s ToS. The platform prohibits using “automated means” to access its videos. Sid Black, the founder of EleutherAI, created the YouTube Subtitles dataset using a script that downloads subtitles from the site’s API using 495 specific search terms. This code was shared on GitHub and has already been bookmarked by over 2,000 users.
The Pile Dataset Continues to Attract Controversy
According to a research paper from the dataset creators, EleutherAI, the YouTube Subtitles dataset is part of a larger compilation called the Pile. This includes material from Wikipedia and the European Parliament and is generally accessible to anyone with internet access and the know-how to find it. EleutherAI refused to respond to Proof News’ request for comment. On its site, it mentions it provides access to “cutting-edge AI techniques by training and releasing models.”
The dataset even includes content that was deleted from YouTube, such as material from creators who completely erased their online presence. Subtitles from over 12,000 deleted videos were found in the collection, and have been incorporated into many AI models.
Wealthy Companies Trained AI Models Using the Pile
According to the Proof News investigation, Apple, Nvidia, and Salesforce used the Pile to train AI models. Documents revealed that Apple used it to train its OpenELM model, released in April shortly before its WWDC event.
Bloomberg, Databrick, and Anthropic, which landed a $4 billion Amazon investment, also used the dataset to train AI models. A spokesperson for Anthropic confirmed that the Pile had been used to train Claude, the company’s generative AI assistant. They explained that YouTube’s terms only cover “direct use of its platform” rather than the Pile dataset and suggested that it was best to speak to the Pile authors regarding any violation of YouTube’s Terms of Service.
Salesforce used the Pile to build an AI model it claimed was for “academic and research”, but later released this for public use in 2022. It has been downloaded over 85,000 times.
Nvidia declined to comment on its use of the Pile, as did representatives for Apple, Bloomberg, and Databricks.
Other Tech Companies Are Using YouTube for AI Training
These tech giants aren’t alone in their use of YouTube to train generative AI. With AI companies in fierce competition, high-quality training data could mean the difference between falling behind or taking the lead.
Earlier this year, The New York Times reported that Google, which owns YouTube, used videos on the platform for text to train its models. According to a spokesperson for the company, this use was permitted under agreements with creators.
The same Times investigation discovered that OpenAI had used YouTube videos without permission, but the company refused to confirm or deny this allegation.
A reporter from The Wall Street Journal asked Mira Murati, chief tech officer for OpenAI, if the company had used YouTube videos to train Sora, its AI model that generates videos from text prompts, but Murati was unsure.
The Ethical Concerns of Using YouTube Videos to Train AI
While it’s true that these corporations likely have no idea exactly where the data they’re using comes from, that doesn’t mean they can use it without permission, argue some creators. Dave Farina, host of the YouTube channel Professor Dave Explains, had 140 videos lifted for YouTube Subtitles and explained that these companies profiting off the work of creators are essentially building models that will put the same creators out of work. He argues that regulation or compensation is needed.
Most creators Proof News spoke to were unaware that their data had been taken, and had no idea how it was being used. This means creators have no choice in the use of their data.
Another concern is that AI will soon be able to generate similar content to that which it has been trained on. and in some cases copy it outright. We’ve already seen this in videos where fake voice clones have been used to read another creator’s script, and it seems that very few commenters recognize these videos as fakes.
Many YouTube creators – and artists or creators on other sites – regularly patrol the internet for sites using their work without authorization, issuing takedown notices to offenders. Proof News created a tool to make it easier for creators to check if their content is included in the YouTube AI training dataset, but what they can actually do about that remains unclear.
AI training datasets have been in the news before, most recently in 2023, when Proof News contributor Alex Reisner discovered that Books3, a Pile dataset, included over 180,000 books by renowned authors including Margaret Atwood and Zadie Smith. Many authors sued AI companies for alleged copyright violations as well as using their work without authorization. The platform hosting Books3 took the dataset down.
In some of these cases, litigation is still in its early stages, so it’s unclear how issues of permission and payment will be resolved. In response to the lawsuits filed, companies including Meta, Bloomberg, and OpenAI argued their actions constituted “fair use.”
In the case of the Pile, the dataset has already been removed from the official download site, but can still be accessed on file-sharing services.
Whether content creators should be compensated for the use of their data to train AI models remains a hot topic of debate, as does the issue of permission to use this data. It’s a topic that is sure to see even more debate as the generative AI race ramps up.