Introduction
At the end of 2023, many experts predicted that 2024 would become the year of multimodal models. For example, Sara Hooker, VP of Research at Cohere, told us, 'Multimodal will become a ubiquitous term.'
But if we sift through our memories, we'll recall that Venture Beat already published an article in 2021 titled 'Multimodal models are fast becoming a reality — consequences be damned.' Why didn’t it become a reality then, but now is most likely indeed becoming a ubiquitous term?
Let’s take a look:

Microsoft’s NUWA generating video from text in 2021

OpenAI’s SORA generating video from text in 2024

OpenAI’s SORA generating video from text in 2024
And though something weird is happening with the guy's limb in the first video by Sora, it's obvious: multimodal models have become a reality. From now on, they will only become more realistic. In this article, we explore the latest developments in multimodal models, how this field has evolved, and its future directions. As a bonus, we'll list the latest multimodal models with links to comprehensive surveys on the topic.
What Is a Multimodal Model? (Definition)
Language models vs. Multimodal models
Single modality → Multimodality, what’s the sense of this transition?
Path to Multimodal Foundation Models
Multimodal Fusion: how to truly combine different data types?
Real-world examples of existing models
Resources of helpful surveys
Future directions
What Is a Multimodal Model? (Definition)
In simple terms, it means the model that can work with two or more types of data or “modalities”. It can process and fuse the information from a combination of modalities. Primary modalities are:
Text: Written text or code represented as text
Vision: Images, videos which are just a set of images
Audio: Speech, music, or any ambient sounds
Sensory data: Information from sensors like temperature, motion, pressure, and other measurements relevant to the specific use case
In the context of foundation models, multimodal usually means combining textual and visual data. However, there are other combinations of modalities to explore that could enable progress on higher-order skills illustrated in the image below.

An illustration of different data sources and foundation model's skills. Image Credit: https://crfm.stanford.edu/assets/report.pdf
When you start reading research papers, you encounter a myriad of terms related to multimodal foundation models. Let’s sort it out →
Multi-modal Large Language Models (MLLMs) in ‘From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness, and Causality through Four Modalities’, Jan 2024.
LLM-based multimodal models in ‘A Survey of Resource-Efficient LLM and Multimodal Foundation Models’, Jan 2024.
Multimodal Foundation Models in ‘Multimodal Foundation Models: From Specialists to General-Purpose Assistants’, Sept 2023,
Large Multimodal Agents in ‘Large Multimodal Agents: A Survey’, Feb 2024.
While these terms are spelled differently, in essence, they usually mean the same thing but through their own lens.
Language models vs. Multimodal models
Following the ideas proposed in the fundamental survey created by Microsoft researchers, we've witnessed the transition of language models from specializing in individual tasks to becoming more broadly applicable. This very shift, from task-specific to task-centric foundation models, is the core idea behind our FMOps series, which we're wrapping up with this edition. We've also explored this historical development in our series on the history of LLMs.
Yet, we lack a similar roadmap for multimodal models. The table below highlights this, posing a question about the next steps toward true multimodal foundation models. The search continues for vision-based and multimodal models that could hold the same transformative potential for their domains as ChatGPT/GPT-4 does for language. While Google Gemini and OpenAI's multimodal GPT-4 are significant advancements, they represent early stages in the vast potential of multimodal models.

Image Credit: https://arxiv.org/pdf/2309.10020.pdf
The latest language models exhibit interesting capabilities, such as interaction and tool use, and lay a foundation for developing general-purpose AI agents. These models are far from perfect but they’re already integrated into major model providers like Microsoft Azure, AWS, and others. Whole companies like OpenAI, Cohere, and Anthropic are formed around this offering of language models and they have their clients, companies are integrating these solutions in their business models.
At the same time, there is still a wide gap between the performance of the latest multimodal models based on their language counterparts like OpenAI’s GPT-4 and Google’s Gemini. Their multimodal capabilities are still very nascent and require improvement as was shown in the big research done by the Shanghai AI Laboratory. The same was noted by the authors of the report “On the Opportunities and Risks of Foundation Models”
So how would we move further?
Single modality → Multimodality, what’s the sense of this transition?
Building general-purpose agents has been a long-standing goal for AI. One promising approach to accomplish this goal is to build multimodal models moving to more comprehensive, general-purpose agents.
This transition is driven by the fact that real-world data is inherently multimodal, especially in complex domains like healthcare which involves various types of data such as medical images (X-rays), structured data (test results), and clinical text (patient histories). Multimodal foundation models aim to fuse this diverse information, offering a holistic understanding of a domain and facilitating more accurate predictions, decisions, and insights.
Another fascinating possibility opened by multimodality is finding the similarities and contrasting features between natural languages based on various modalities (image, speech, sign, text).

Image Credit: https://crfm.stanford.edu/assets/report.pdf
Here are some generalized reasons for transitioning to multimodality:
Richer Data Representation: Multimodal models can capture a more complete representation of the world. This comprehensive approach is crucial for applications where information comes in varied forms, such as in healthcare, ambient intelligence, mobile applications, and robotics.
Improved Reasoning: Multiple modalities allow the exploitation of visual aspects to enhance reasoning. This approach mirrors human problem-solving, where visual and textual cues are combined to navigate complex issues. There is also an ongoing discussion of whether foundation models can understand language without grounding (meaning that language understanding needs to be connected to real-world experiences and perception.)
Bridging Research Areas: The multimodal approach has facilitated collaboration across previously siloed research areas like computer vision, NLP, and robotics. This integration fosters innovative solutions that leverage the strengths of each field, enhancing the AI's ability to understand and interact with its environment.
The McGurk Effect illustrates how we all use visual speech information. The effect shows that we can't help but integrate visual speech into what we 'hear'.
Path to Multimodal Foundation Models
In this section, we will focus on the main models and their architectures moving from multimodal foundation models’ predecessors to the multimodal foundation models themselves.
Supervised pre-training era
It all started with training the vision models with supervised pre-training on vast, human-labeled datasets like ImageNet and ImageNet21K. It was the time of prominent vision architectures like AlexNet, ResNet, vision transformer and Swin transformers. It powered an array of computer vision tasks, broadening the spectrum from image classification to video action recognition.
Yet, the ambition of AI research to mimic human visual comprehension demanded more. The scalability and diversity of supervisions posed a bottleneck, given the expensive nature of human annotation.
Without having an alternative approach in mind, the field pivoted towards exploiting web-crawled image-text pairs. The data was noisy but cheap which enabled the construction of classification datasets like JFT and I2E, propelling models like the BiT (“Big Transfer”) and scale up the training of a plain vision transformer.
Visual Understanding Models
The Advent of Self-supervised Learning
It was self-supervised learning (SSL) paradigms that changed everything. They removed the need for labeled data pushing the boundaries of what AI systems could learn and understand from visual inputs alone.
SSL methods include:
Contrastive learning is based on the idea of promoting the positive sample pairs and repulsing the negative sample pairs. Methods are like contrastive language-image pre-training (CLIP) with models like ALIGN, Florence, BASIC, and OpenCLIP.
Non-contrastive learning solves a caveat of contrastive learning that it requires a large number of negative samples. These methods do not depend on negative samples. Negatives are replaced by asymmetric architectures (BYOL, SimSiam), and dimension de-correlation (VICReg, Whitening, DINO).
Masked image modeling that revolutionized the field by applying the ideas behind BERT to visual tasks with the pioneering model, BEiT.
Visual Generation Models
The advent of large-scale image-text datasets has catalyzed the development of foundation image generation models, leveraging innovative techniques such as vector-quantized Variational AutoEncoders (VAEs), diffusion-based models, and auto-regressive models. This progress has enabled two primary areas of research:
Text-conditioned Visual Generation: This field aims to create authentic visual content, including images and videos, from open-ended text descriptions or prompts. It involves developing generative models that can produce high-quality visuals closely aligned with the provided text. Leading examples of text-to-image generation include DALL-E, DALL-E 2, Stable Diffusion, Imagen, and Parti. Furthermore, text-to-video generation models like Imagen Video and Make-A-Video extend these capabilities to dynamic video content, synthesizing motion visuals based on textual instructions.
Human-aligned Visual Generator: Focused on refining pre-trained visual generators to more accurately interpret and execute human intentions, this area addresses various challenges faced by initial visual generators. Efforts include enhancing spatial controllability, ensuring models more faithfully replicate the details of text prompts, enabling text-based visual editing, and allowing for the customization of visual concepts. This pursuit of improved alignment between AI-generated visuals and human creative goals is evidenced by advancements in spatial controllability, adherence to prompts, text-based editing flexibility, and concept customization.
Multimodal Fusion: how to truly combine different data types?
The landscape of AI research is increasingly moving towards the development of general-purpose interfaces, transcending the limitations of models designed for narrow sets of tasks within computer vision (CV). This shift has catalyzed the exploration of versatile AI agents across three key research domains:
Unified Models for Visual Understanding and Generation: Borrowing the concept of unification from Large Language Models (LLMs) in Natural Language Processing (NLP), there's a growing effort to create comprehensive models that can handle a broad spectrum of tasks in computer vision and vision-language interactions. Efforts range from integrating vision and language to handle open-set vision tasks with models like CLIP, GLIP, and OpenSeg, to unifying vision-language understanding across various levels of granularity with models such as UniTAB, Unified-IO, and Pix2Seq-v2. Additionally, the push towards making these models more interactive and capable of responding to prompts has led to innovations like SAM and SEEM, which aim to enhance user engagement through AI-driven dialogues.
Integration with Large Language Models for Training: This approach aims to harness the extensive capabilities of LLMs in multimodal environments, leading to the creation of models like Flamingo and Multimodal GPT-4. These models are trained end-to-end to navigate multimodal tasks effectively, marking a significant advancement in AI's ability to understand and generate complex visual and textual content.
Chaining Multimodal Tools with LLMs: Leveraging the sophisticated tool-use capabilities of LLMs, recent studies focus on combining LLMs with multimodal foundation models to improve image understanding and generation. By employing a conversational interface, this strategy merges the strengths of NLP and computer vision, fostering the development of AI systems that can process visual inputs and produce responses in a human-like manner. Key examples of this innovative approach include Visual ChatGPT and MM-REACT, which exemplify the potential for creating more dynamic and conversational AI agents.
Real-world examples of existing models

Image Credit: https://arxiv.org/pdf/2311.13165.pdf
We recommend checking the latest surveys to find a comprehensive list of the latest multimodal language models:
June 2023: A Survey on Multimodal Large Language Models
Feb 2024: Large Multimodal Agents: A Survey
Great GitHub repository summarizing the latest multimodal language models linking to the papers, code, and demos: Awesome-Multimodal-Large-Language-Models
Future directions
To enhance multimodal application performance, addressing several fundamental challenges is crucial:
Modalities Expansion: Leveraging diverse sensors and data sources is essential for more comprehensive and precise analysis. For instance, integrating modalities like audio, facial expressions, ECG, and EEG can enrich emotion computation, while combining various medical imaging techniques (CT scans, MRIs, PET) provide detailed diagnostic information.
Optimizing Training Architectures: The significant computational demands of large models necessitate strategies like distributing computations across clusters and dynamic scheduling to improve training efficiency. Challenges include managing multi-user environments, ensuring model reliability, and facilitating shared computation.
Lifelong/Continual Learning: Moving beyond isolated learning to models that retain and apply knowledge over time is crucial for real-world applications. Developing models with continuous learning capabilities is essential for evolving and improving based on new experiences.
Conclusion
The future of multimodal models is promising with a huge demand from such industries as advanced robotics and self-driving cars. As SORA’s recent debate around physics demonstrated multimodal models still are far from human-level understanding, but by enriching models with all sorts of data, we clearly make them much more efficient and their potential is vast. In healthcare, combining multi-source data could pave the way for truly personalized diagnoses and treatments. Multimodal robots could better understand and interact with their complex environments. The creative industries could see an explosion of tools allowing for the generation of artwork, videos, and more from simple text descriptions.
Of course, challenges remain. We need to address the computational costs of these models, the potential for biases within datasets, and develop robust evaluation metrics for understanding their full capabilities.
We successfully evolved from task-specific ML to use-case-centric or task-centric AI. Now, we have foundation models that we can fine-tune to our specific use cases while letting them also multitask simultaneously. What a fascinating time of development ahead of us!
Thank you for reading this series. Next week, we will finalize it with a wrap-up and a brief description of each Token.
Resources From Turing Post:
How did you like it?
Thank you for reading, please feel free to share with your friends and colleagues and receive one month of Premium subscription for free 🤍







