The term modality refers to the way in which something is experienced or happened. Most often, however, the term modality is associated with sensory modality or channels of communication or sensation, hence multimodality refers to multiple data modalities (such as image, text and speech). Multimodal artificial intelligence is an emerging field of AI which deals with enabling AI to process and relate multimodal data.
Multimodal vs Unimodal
Traditionally AI systems are unimodal, as they are designed to perform a particular task such as image processing and speech recognition. Given a single input such as an image or speech, the AI system is able to identify corresponding images or words.
When AI is designed to deal with a single source of information, the system ignores vital contextual and supporting information to make its deductions. Multimodal AI posits that by engaging a variety of data modalities, we can better understand and analyze the information.
Challenges in Multimodal Learning
The ability to process multimodal data concurrently is vital for advancements in AI. For instance, it would enable us to refer to an object with multiple languages such as visual, text or speech. This requires, however, comprehensive understanding of different modalities and relationships between them. To achieve this, we need to address several key challenges:
- Representation: the ability of AI systems to represent multimodal data with “grounded representations” – common language for all multimodal data.
- Translation: the ability of an AI system to translate one modality into another.
- Alignment: the requirement of an AI system to identify associations among elements of different modalities.
- Fusion: the ability of an AI system to process multimodal data jointly to perform a prediction task.
- Co-learning: the capacity of an AI system to transfer knowledge between modalities.
Multimodal Learning Systems
While addressing these challenges, AI researchers have recently made exciting breakthroughs towards multimodal learning. Some of these developments are summarized below:
- DALL.E is an AI system developed by OpenAI to effectively convert text into an appropriate image for a wide spectrum of concepts, utterable in natural language. The system is essentially a neural network consists of 12 billion parameters.
- ALIGN is an AI model trained by Google over a noisy dataset of a large number of image-text pairs. The model has achieved best accuracy on several image-text retrieval benchmarks.
- CLIP is another multimodal AI system developed by OpenAI to successfully perform a wide set of visual recognition tasks. Given a set of categories described in natural language, CLIP can promptly classify an image into one of these categories, without the need of data for these categories.
- MURAL is an AI model developed by Google AI for image-text matching and translating one language to another. The model uses multitask learning applied to image–text pairs in association with translation pairs in over 100 languages.
- VATT is a recent project of Google AI to build a multimodal model over Video-Audio-Text. VATT can make predictions for multimodalities from raw data. Not only does it generate descriptions of events in videos but is also able to pull up videos given a prompt, classify audio clips and identify objects in images.
- FLAVA is a multimodal model trained by Meta over images and 35 different languages. The model has performed well on a variety of multimodal tasks.
- NUWA is a joint venture of Microsoft Research and Peking University to produce new or change existing images and videos for a variety of media creation tasks. The model is trained on images, videos and text, and given a text prompt or sketch, it can predict next the video frame and fill in incomplete images.
- Florence is released by Microsoft research which is capable of modeling the space, time and modality. The model can solve many popular video language tasks.
Cross Modal Applications
The recent development in multimodal AI has given rise to many cross modality applications. Some of these popular applications are:
- Image Caption Generation: Given an image as input, image caption generation deals with generating a description of the image. Image caption generators are used to assist visually-impaired people and are able to automate and accelerate the closed captioning process for digital content production.
- Text-to-Image Generation: This can be regarded as the reverse of image caption generation. In this case, given a text as input, the AI can generate an image.
- Visual Question Answering (VQA): In VQA, the model takes an image and text-based question as input and generates the text-based answer as output. VQA is different from traditional NLP question answering because VQA reasoning is performed on the content of an image, whereas NLP is performed with text.
- Text to Image & Image to Text Search: Web search is another fascinating application of multimodal AI where given a query in a single mode, the search engine identifies sources based on multiple modalities. An example of such an AI system is Google’s ALIGN model.
- Text to Speech Synthesis: This type of assistive technology reads digital text. The technology is used with many personal digital devices such as computers, smartphones and tablets.
- Speech to Text Transcription: This technology deals with recognizing spoken language and translating it into text format. It is used in many applications such as digital assistants (e.g. Apple’s Siri and Google Assistant), medical transcription and speech-enabled technologies (such as websites and TV remotes).
We, as human beings, have the innate ability to process multiple modalities—the real-word is inherently multimodal. The progression towards multimodal learning in AI has the potential to unfold the long-standing endeavor of the science to move away from statistical analytics of a single modality (such as images, text or speech) towards multifaceted understanding of multiple modalities and their interaction.