Learnings from ai-Pulse: Challenges in Building Data Sets for LLMs

Generative artificial intelligence (AI) applications such as ChatGPT, DALL·E, and MidJourney are powered by large language models (LLMs) trained on vast data sets.

The effectiveness of these models relies heavily on the quality and diversity of their input sources.

Building valuable data sets presents several challenges for developers, from technical constraints and data quality to legal considerations.

The topic was discussed at length at this month’s Ai-PULSE conference in Paris, and we pulled out some key findings and discussion points from the panels held there.

Technological Constraints of LLMs

The large scale of LLMs poses technical challenges regarding computational power and resources. Training these models requires significant computational processing and storage capacity, limiting access to researchers and organizations with sufficient infrastructure to run servers continuously for weeks. Additionally, there is a limit to the amount of data tokens available to incorporate into data sets.

Tokens are the units of text or code that LLMs use to process and generate language.

Ensuring Data Quality

The quality of the data used to build data sets is crucial. “Data quality is an issue everywhere, so for a large language model, typically, there’s a component where you cannot be too picky,” Launay said.

The Laion data set scraped the Internet for images to finish with around 5.6 billion pictures — but only 2 billion are of good enough quality to be used, according to Pablo Ducru, co-founder at Raive.

Removing duplication is an essential factor in data quality. “Simply running the de-duplication, which you can do at a very large scale and in a way that’s completely agnostic to the underlying content of the data, gives you a very good result,” Launay said.

Preparing data before sending it to the model by removing artifacts, such as distinguishing between ads, boilerplate, and actual content scraped from a web crawler, also helps to improve the input quality.

Particularly regarding visuals, using high-quality data is essential to creating effective images.

“You have to be very selective — as the saying goes, ‘garbage in, garbage out’ in terms of training, so you need to have a real selection of the visual universe you’re putting in,” Ducru said.

The breadth of training data is also important.

For instance, while MidJourney and Stability AI have sourced data from the Internet, Adobe opted to train Firefly only on stock media.

Ducru said:

“The problem is then it becomes kind of culturally poor — it doesn’t know anything other than stock media and cannot generate something that looks outside of the stock media, so the challenge of generating a model that is both respectful of intellectual property but at the same time has a knowledge breadth that goes beyond stock media is one of the core challenges… and requires a very careful selection of the way you take the visual representations.”

French biotech startup Owkin works closely with hospital clinicians and other medical experts to collect, curate, and standardize data.

“Most of the data we have access to is retrospective data, so they have been collected for the last 10-20 years, sometimes with lots of follow-up for patients.

“So, it’s a high quantity of data requiring a lot of curation to simplify and to standardize.

“Now we are also starting to collect prospectively the data, which allows us to exactly collect the data in the relevant format… which is key. It means we have to build this way of collection at the beginning of the process to ensure we will have the relevant level of quality in the end,” said Agathe Arlotti, Owkin’s Senior Vice President, Partnerships.

“If we access random data, there is a high chance we will not get to the high level of performance we require, or when we want to validate the model, we will have some questions, so we will need to re-access the overall data set to confirm the results. Anticipating the collection is key for us, and closely working with hospitals is a very good way to do that.”

Depth of data is important for Owkin, so it collects multimodal data, including clinical data, pathologist slides obtained from biopsy samples, molecular data, and genetic data that provides information about the patient’s gene activation.

Multimodality of LLMs

Large LLM labs are looking into multimodality for more than the novelty of having a model that can understand images, as it simplifies certain tasks and expands the quality of outputs.

“We are reaching the limits of high-quality text we can shove into a model… there’s a big gold rush essentially in finding another modality that will bring new information to the model and genuine new information,” Launay said. “It’s a 10 billion, 100 billion dollar question, how you can get true multimodality — not just putting the modalities next to one another, hooking them up and hoping for the best, but getting them to benefit one another… we’ll see whether an image is worth a thousand words.”

Humans are multimodal by definition, for instance, in how we use language to describe images. However, if a text-to-image LLM learns from photos alone, it conceptualizes the world in two dimensions (2D). There are deep neural networks that can extract some form of depth map, effectively 2.5D, which they cannot learn only from images.

“The way you fuse the textual world and the visual representation of them really can give way more powerful results at generation,” Ducru said.

Intellectual Property Rights and Regulation

One of the controversial aspects of generative AI tools such as ChatGPT, Mid Journey, and Stability AI is that they were trained on data sets compiled without the permission of copyright holders.

This has resulted in several writers, artists, and other content creators filing lawsuits against the developer companies claiming infringement of their copyrights or other intellectual property (IP) rights.

Raive is creating its own proprietary multimedia data set containing images and video, finding what online content it can access and what licenses it can pay for.

“We’re going to essentially open a new market… a new type of licensing for training data,” Ducru said. “The entire multimedia industry relies on licensing copyrights of certain images or videos and collecting royalties on these. This is a new form of royalty for training but without the distribution rights.”

For instance, Meta approached stock media companies, offering to pay upfront to train its models on their catalogs. “Even if it happened to be fair use, it might have value to have better quality data that is easier to access,” Ducru said.

In healthcare, patient data has long been subject to regulation. Concerns around privacy and sensitive personal information pose challenges to developing high-quality LLM datasets.

Handling Patient Confidentiality and AI

Striking a balance between access to data for research purposes and protecting individual privacy is crucial. In Europe, GDPR is applied to accessing patient data to protect patient rights, ensuring they have consented to the way their data is being used and have the opportunity to opt out.

Suppose a health data set contains anonymized data. In that case, the developer can not add further data based on the patient’s identification or train a new version of a model using pseudonymized data without re-confirming the patient’s consent to the recent use of their data.

Ethical committees also need to approve various uses of the data, and development teams must understand legal restrictions in different jurisdictions.

“Realistically today that’s slowing down a lot of research projects, so we need to… revisit how GDPR can be better applied to healthcare and at the same time protect the patient, but still allow us to better innovate in a quicker manner,” Arlotti said.

“If you ask a patient whether or not they agree to give access to the data for research purposes, 90% of them accept that, so we need also to revisit the way we collect consent to make sure data are more easily accessible and can be accessed for AI models.”

Countries such as Japan and Israel have taken the position that if content is used for research purposes, it is fair use, but if it is used for commercial applications, different standards apply.

However, measuring whether a piece of content has been transformed sufficiently within a fair use context is difficult to quantify and more subjective than something readily identifiable, like a trademark.

“It’s up to the courts right now to decide, “Ducru said.

Balancing Creativity, Data, and IP

Models can potentially reach the point of bootstrapping themselves so developers can reduce their reliance on human content they may be unsure about using.

“One of the problems that we have today is the only solutions that have been proposed are constructive, either you do an Adobe Firefly approach where you’re kind of ignorant — or you forbid so you say this is against a content policy.

“There is a way to create a knowledge breath that has no clear IP attribution, so you anonymize at the moment of annotation, such that if you “ask for Beyoncé” it doesn’t know that there’s a Beyoncé — but it has learned the likeness of an artist,” Ducru said.

“However, how do you make a system where the artist can profit and benefit from this AI revolution, and say: “this is my AI, and I want to publish it”, and start making money and getting royalties out of this?”

The Bottom Line

Building data sets for LLMs is a complex task that is increasingly coming up against the constraints of technological capacity, data quality, regulations, and individual rights.

As more developers train LLMs, the challenges of building valuable data sets that take the technology forward will become more apparent.

Researchers and developers must collaborate in finding solutions that balance the need for extensive and diverse data sources with ethical and responsible use.