News publishers are tired of generative AI, along with its ally — web scraping.
This week, a coalition of news publishers filed a new lawsuit against Microsoft and OpenAI, alleging that the two organizations unlawfully used copyrighted articles to train ChatGPT and Copilot without permission or payment.
The coalition includes the New York Daily News, the Chicago Tribune, Orlando Sentinel, and San Jose Mercury News, all of which are owned by AldenGlobal Capital (AGC).
This lawsuit alleges that “Defendants have created those GenAI products [ChatGPT and Copilot] in violation of the law by using important journalism created by the publishers’ newspapers without any compensation.”
AGC’s lawsuit reaffirms the notion that generative AI presents an existential threat to news publishers who not only have to compete with virtual assistants like ChatGPT as alternative news sources — but also AI-generated news itself.
Why Generative AI is Resting on a House of Cards
Generative AI is sitting on a house of cards — and news publishers are the cards. Large language model (LLM) vendors need to train language models on high-quality written content so they can learn how to process and generate natural language texts. As a result, AI vendors scrape the web for data or use curated repositories like Common Crawl.
The problem is that much of this material is copyrighted.
And in many cases, rather than ask copyright holders for permission first, many LLM developers have just used these materials and waited to find out the legality later.
Alon Yamin, CEO and co-founder at Copyleaks, told Techopedia:
“LLMs trained on datasets like Common Crawl may face copyright liabilities if they inadvertently incorporate copyrighted content without proper authorization.
“AI developers should implement robust measures to filter copyrighted material and respect intellectual property rights during dataset selection and model training.”
This is clearly highlighted by OpenAI founder and CEO Sam Altmans’ testimony to the UK House of Lords.
“Because copyright today covers virtually every sort of human expression — including blog posts, photographs, forum posts, scraps of software code, and government documents — it would be impossible to train today’s leading AI models without using copyrighted materials,” Altman said.
The Morals of LLM Development
While generative AI products may require copyrighted materials in order to be trained to a high standard, there is a legal and moral argument that copyright holders should also be compensated.
According to this week’s complaint:
“The publishers have spent billions of dollars sending real people to real places to report on real events in the real world and distribute that reporting in their print newspapers and on their digital platforms.
“Yet defendants are taking the publishers’ work with impunity and are using the publishers’ journalism to create GenAI products that undermine the publishers’ core businesses.”
In this case, the plaintiffs included multiple excerpts of conversations with ChatGPT and Copilot where the chatbots reproduced excerpts of specific articles.
It is a reasonable take to sympathize with a publisher and its journalists, who invest time and money into covering a news story only for a third-party to use that work as part of another product.
That being said, there is still some legal ambiguity around the use of copyrighted data when training LLMs.
Cache Merrill, founder and CTO at software development firm Zibtek, told Techopedia:
“The legality of scraping for LLM training is currently a gray area in copyright law. It largely depends on the jurisdiction and specific legal frameworks concerning fair use and copyright exceptions, which are now being challenged by these lawsuits.
фкеш
“The outcome of this lawsuit could set a significant legal precedent for how data is used in training AI. It’s likely that we will see either stricter regulations emerge or a push towards more transparent and compensatory models for data usage in AI.”
In any case, with this lawsuit coming just months after The New York Times sued OpenAI and Microsoft for copyright infringement and “billions of dollars in statutory and actual damages,” it’s clear that the conflict between LLM vendors and publishers over copyright isn’t going to go away anytime soon.
Does a Third Option Even Exist?
Although it’s impossible to tell how this lawsuit will play out, the choice for LLM vendors seems simple; fight a legal battle with publishers until a clearcut precedent is established, or offer enough compensation to get the publishers to vote for an alternative news source.
OpenAI is already in the process of building partnerships — just earlier this month it announcing that it had reached an agreement with The Financial Times to use its content to help train its AI models.
These partnerships may appear to be the future, but publications that believe generative AI “undermine” their core business model are unlikely to accept a short-term payoff.
After all, it would be ill-advised to enter into a commercial agreement with a company that could put you out of business.
We’ve already seen the growth in AI-generated “news” as an alternative – with research from NewsGuard finding 49 news sites that are “almost entirely written by artificial intelligence software.”
Likewise, some users might go to a tool like ChatGPT or Gemini and simply ask “what’s the news?” to view a summary of news events (although with less assurance that the summary is reliable, accurate, or well-curated).
Both AI-generated news and language models as “news sources” have the potential to take traffic away from publishers. For this reason, partnerships such as the one between OpenAI and The Financial Times are likely to be few and far between and extremely expensive.
The Bottom Line
The genie is out of the bottle and for better or worse, AI-generated content exists.
While such content can’t replace the level of coverage offered by on-the-ground reporters communicating with experts, it can take traffic and revenue away from traditional news platforms, putting the Fourth Estate at risk.
It is a tricky road to navigate — without revenue, the ability of reporters to “deliver the truth to power” or to ask the questions someone doesn’t want to answer evaporates.
And yet AI is not going anywhere — and neither is its appetite for knowledge.