Why the LLM "Arms Race" is Going Multimodal

Why the LLM “Arms Race” is Going Multimodal

As the old expression goes, “A picture is worth a thousand words,” and over the past year, multimodality – the ability to enter inputs in multiple formats like text, image, and voice – is emerging as a competitive necessity in the large language model (LLM) market.

Just earlier this week, Google announced the release of Assistant with Bard, a generative AI-driven personal assistant that comes with Google Assistant and Bard together, which will enable users to manage personal tasks via text, voice, and image input.

This comes just a week after OpenAI announced the release of GPT-4V, allowing users to enter image inputs into ChatGPT. It also comes the same week as Microsoft confirmed that Bing Chat users would have access to the popular image generation tool DALL-E 3.

These latest releases from OpenAI, Google, and Microsoft highlight that multimodality has become a critical component for the next generation of LLMs and LLM-powered products.

Training LLMs on multimodal inputs will inevitably open the door to a range of new use cases that weren’t available with text-to-text interactions.

The Multimodal LLM Era

While the idea of training AI systems on multimodal inputs isn’t new, 2023 has been a pivotal year for defining the type of experience generative AI chatbots will provide going forward.

What Multimodal LLMs Mean for Users

The market’s shift towards multimodal LLMs has some interesting implications for users, who will have access to a much wider range of use cases, translating text to images and vice versa.

For instance, a study released by Microsoft researchers experimented with GPT-4V’s capabilities and found a range of use cases across computer vision and vision language, including image description and recognition, visual understanding, scene text understanding, document reasoning, video understanding, and more.

A particularly interesting capability is GPT-4V’s ability to manage “interleaved” image-text inputs.“

“This mode of mixed input provides flexibility for a wide array of applications. For example, it can compute the total tax paid across multiple receipt images,” the report said.

“It also enables processing multiple input images and extracting queried information. GPT-4V could also effectively associate information across interleaved image-text inputs, such as funding the beer price on the menu, counting the number of beers, and returning the total cost.”

Challenges to Overcome

It’s important to note that while multimodal LLMs open the door to a range of use cases, they’re still vulnerable to the same limitations as text-to-text LLMs. For instance, they still have the potential to hallucinate and respond to users’ prompts with facts and figures that are provably false.

At the same time, enabling other formats, like images, as input presents new challenges. OpenAI has quietly been working to implement guardrails to stop GPT-4V from being used to identify persons and compromise CAPTCHAs.

A study released by the vendor has also highlighted multimodal jailbreaks as a significant risk factor. “A new vector for jailbreaks with image input involves placing into images some of the logical reasoning needed to break the model,” the study said.

“This can be done in the form of screenshots of written instructions or even visual reasoning cues. Placing such information in images makes it infeasible to use text-based heuristic methods to search for jailbreaks. We must rely on the capability of the visual system itself.”

These concerns align with another study released earlier this year by Princeton University researchers who warned that the versatility of multimodal LLMs “presents a visual attacker with a wider array of achievable adversarial objectives,” essentially widening the attack surface.

The Bottom Line

With the LLM arms race going multimodal, it’s time for AI developers and enterprises to consider potential use cases and risks presented by this technology.

Taking the time to study the capabilities of these emerging solutions will help organizations make sure they get the most out of adoption while minimizing risk.