As the old expression goes, “A picture is worth a thousand words,” and over the past year, multimodality – the ability to enter inputs in multiple formats like text, image, and voice – is emerging as a competitive necessity in the large language model (LLM) market.
Just earlier this week, Google announced the release of Assistant with Bard, a generative AI-driven personal assistant that comes with Google Assistant and Bard together, which will enable users to manage personal tasks via text, voice, and image input.
This comes just a week after OpenAI announced the release of GPT-4V, allowing users to enter image inputs into ChatGPT. It also comes the same week as Microsoft confirmed that Bing Chat users would have access to the popular image generation tool DALL-E 3.
These latest releases from OpenAI, Google, and Microsoft highlight that multimodality has become a critical component for the next generation of LLMs and LLM-powered products.
Training LLMs on multimodal inputs will inevitably open the door to a range of new use cases that weren’t available with text-to-text interactions.
The Multimodal LLM Era
While the idea of training AI systems on multimodal inputs isn’t new, 2023 has been a pivotal year for defining the type of experience generative AI chatbots will provide going forward.
At the end of 2022, mainstream awareness of generative AI chatbots was largely defined by the newly released ChatGPT, which provided users with a verbose text-based virtual assistant that they could ask questions much like Google search (although the solution wasn’t connected to the internet at this stage).
It’s worth noting that text-to-image LLMs like DALL-E 2 and Midjourney were released earlier in 2022, and the utility of these tools was confined to the creation of images rather than providing users and knowledge workers with a conversational resource in the way that ChatGPT did.
It was in 2023 that the line between text-centric generative AI chatbots and text-to-image tools began to blur. This was a gradual process but can be seen to emerge after Google released Bard in March 2023 and subsequently gave users the ability to enter images as input just two months later at Google I/O 2023.
At that same event, Google CEO Sundar Pichai noted that the organization had formed Google DeepMind, bringing together its Brain and DeepMind teams to begin working on a next-generation multimodal model named Gemini, and reported the team was “seeing impressive multimodal capabilities not seen in prior models.”
At this point in the LLM race, while ChatGPT and GPT4 remained the dominant generative AI tools on the market, Bard’s support for image input and connection to online data sources were key differentiators from competitors like OpenAI and Anthropic.
Microsoft also started moving toward multimodality in July, adding support for image inputs to its Bing Chat virtual assistant, which launched back in February 2023.
Now, with the releases of GPT-4V and Assistant with Bard offering support for image inputs and, in the case of the latter, voice inputs, it is clear that there is a multimodal arms race occurring in the market. The goal is to develop an omnichannel chatbot capable of interacting with text, image, and voice inputs and responding appropriately.
What Multimodal LLMs Mean for Users
The market’s shift towards multimodal LLMs has some interesting implications for users, who will have access to a much wider range of use cases, translating text to images and vice versa.
For instance, a study released by Microsoft researchers experimented with GPT-4V’s capabilities and found a range of use cases across computer vision and vision language, including image description and recognition, visual understanding, scene text understanding, document reasoning, video understanding, and more.
A particularly interesting capability is GPT-4V’s ability to manage “interleaved” image-text inputs.“
“This mode of mixed input provides flexibility for a wide array of applications. For example, it can compute the total tax paid across multiple receipt images,” the report said.
“It also enables processing multiple input images and extracting queried information. GPT-4V could also effectively associate information across interleaved image-text inputs, such as funding the beer price on the menu, counting the number of beers, and returning the total cost.”
Challenges to Overcome
It’s important to note that while multimodal LLMs open the door to a range of use cases, they’re still vulnerable to the same limitations as text-to-text LLMs. For instance, they still have the potential to hallucinate and respond to users’ prompts with facts and figures that are provably false.
At the same time, enabling other formats, like images, as input presents new challenges. OpenAI has quietly been working to implement guardrails to stop GPT-4V from being used to identify persons and compromise CAPTCHAs.
A study released by the vendor has also highlighted multimodal jailbreaks as a significant risk factor. “A new vector for jailbreaks with image input involves placing into images some of the logical reasoning needed to break the model,” the study said.
“This can be done in the form of screenshots of written instructions or even visual reasoning cues. Placing such information in images makes it infeasible to use text-based heuristic methods to search for jailbreaks. We must rely on the capability of the visual system itself.”
These concerns align with another study released earlier this year by Princeton University researchers who warned that the versatility of multimodal LLMs “presents a visual attacker with a wider array of achievable adversarial objectives,” essentially widening the attack surface.
With the LLM arms race going multimodal, it’s time for AI developers and enterprises to consider potential use cases and risks presented by this technology.
Taking the time to study the capabilities of these emerging solutions will help organizations make sure they get the most out of adoption while minimizing risk.