OpenAI has submitted a written statement to the UK’s House of Lords claiming it would be “impossible” to create services like ChatGPT without using copyrighted material.
“Because copyright today covers virtually every sort of human expression – including blog posts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials,” said OpenAI in its evidence (PDF) submitted to the House of Lords communications and digital select committee.
As The Telegraph reports, the submission marks an attempt to lobby for a revision of copyright law in the UK. It comes just after The New York Times announced it was suing OpenAI and Microsoft for billions, alleging that the organizations scraped millions of articles from its website to train ChatGPT, which it also said would generate “verbatim” excerpts from the articles.
A Brief Look at OpenAI’s Argument for Fair Use
OpenAI’s evidence admits that ChatGPT is trained on data publicly available on the internet, which includes copyrighted material, but asserts that “we believe that legally copyright law does not forbid training.”
In a blog post released today, January 8th, OpenAI clarified this stance further, arguing that training AI models on publicly available training data is fair use and provides a list of academics, companies, and other groups who’ve recently submitted comments to the US copyright office.
- NYT vs. OpenAI 2024: Everything You Need to Know Today
- Microsoft’s Copilot Mobile AI Assistant Hits 1M App Downloads in a Week
- Who Owns ChatGPT?
- The Best Open-Source LLMs to Watch
- 12 Highest Paid AI Jobs for 2024
Co-founder of Coursera and ex-head of Baidu AI Group, Google Brain, Andrew Ng, also recently released a post on X stating that he “would like to see training on the public internet covered under fair use.”
“I understand why media companies don’t like people training on their documents, but believe that just as humans are allowed to read documents on the open internet, learn from them, and synthesize brand new ideas, AI should be allowed to do so too,” Ng said.
(1) Claims, among other things, that OpenAI and Microsoft used millions of copyrighted NYT articles to train their models
— Andrew Ng (@AndrewYNg) January 7, 2024
However, whether training large language models (LLMs) on copyrighted materials is in violation of copyright law will ultimately be decided by regulators and not AI advocates. Until then, the ambiguity around the legality of training on copyrighted material hangs over generative AI.
LLMs, Copyright, and Controversy
Ever since ChatGPT reached the mainstream consciousness, LLMs have been surrounded by controversy. Over the past year alone, industry leaders, including OpenAI, Anthropic, Google, Midjourney, and Stable Diffusion, have all been hit with lawsuits alleging copyright infringement.
It’s also worth noting that back in July 2023, The Daily Mail was in the process of considering legal action against Google over copyright violations, claiming that Google used a cache of 1 million news articles from the Daily Mail and CNN to develop Bard.
More cases of this nature will continue to crop up until a regulator or lawmaker decides whether training LLMs on copyrighted material is fair use. If they decide it isn’t – then development will undoubtedly become more complicated.
At the very least, vendors would have to request permission or pay publishers to use their copyrighted works. We’re already seeing an effort to move to these types of arrangements, with OpenAI reportedly undergoing negotiations with dozens of publishers.
Could Regulation Kill Generative AI?
Given the widespread interest and investment we have seen in generative AI over the past few years, it is unlikely that tighter copyright laws would damage the market overnight.
It is more likely that AI vendors are forced to be more proactive about establishing mutually beneficial relationships with publishers and compensating them for access to past articles, books, and other copyrighted works.
For instance, while file-sharing service Napster went bankrupt for copyright infringement after allowing users to download copyrighted works without compensation, Spotify later emerged to build a thriving streaming service of over 100 million songs by establishing commercial relationships to ensure songwriters, publishers, artists, and other rights holders get fairly compensated.
If such a licensing model is possible for the music industry, then in theory, it should be possible for AI vendors to form partnerships with publishers so they can get access to copyrighted material in exchange for financial compensation.
That being said, the formation of such a commercial partnership rests on publishers not seeing LLMs as an existential threat to their business.
The era of copyright as an afterthought is coming to an end in the generative AI market. As it stands, we will see more and more lawsuits against LLM vendors until a clear legal precedent is established around the legality of training these solutions on copyrighted material.