• Turing Post
  • Posts
  • Token 1.24: Understanding Multimodal Models

Token 1.24: Understanding Multimodal Models

and what so special about them in 2024


At the end of 2023, many experts predicted that 2024 would become the year of multimodal models. For example, Sara Hooker, VP of Research at Cohere, told us, 'Multimodal will become a ubiquitous term.'

But if we sift through our memories, we'll recall that Venture Beat already published an article in 2021 titled 'Multimodal models are fast becoming a reality — consequences be damned.' Why didn’t it become a reality then, but now is most likely indeed becoming a ubiquitous term?

Let’s take a look:

Microsoft’s NUWA generating video from text in 2021

OpenAI’s SORA generating video from text in 2024

OpenAI’s SORA generating video from text in 2024

And though something weird is happening with the guy's limb in the first video by Sora, it's obvious: multimodal models have become a reality. From now on, they will only become more realistic. In this article, we explore the latest developments in multimodal models, how this field has evolved, and its future directions. As a bonus, we'll list the latest multimodal models with links to comprehensive surveys on the topic.

  • What does multi-modal mean?

  • Language models vs. Multimodal models

  • Single modality → Multimodality, what’s the sense of this transition?

  • Path to Multimodal Foundation Models

  • Multimodal Fusion: how to truly combine different data types?

  • Real-world examples of existing models

  • Resources of helpful surveys

  • Future directions

What does multimodal mean?

In simple terms, it means the model that can work with two or more types of data or “modalities”. It can process and fuse the information from a combination of modalities. Primary modalities are:

  • Text: Written text or code represented as text

  • Vision: Images, videos which are just a set of images

  • Audio: Speech, music, or any ambient sounds

  • Sensory data: Information from sensors like temperature, motion, pressure, and other measurements relevant to the specific use case

In the context of foundation models, multimodal usually means combining textual and visual data. However, there are other combinations of modalities to explore that could enable progress on higher-order skills illustrated in the image below.

An illustration of different data sources and foundation model's skills. Image Credit: https://crfm.stanford.edu/assets/report.pdf

When you start reading research papers, you encounter a myriad of terms related to multimodal foundation models. Let’s sort it out →

The rest of this article is available to our Premium users only. To learn more about Multimodal models and be able to use them in your work, please –>

How did you like it?

Login or Subscribe to participate in polls.

Thank you for reading, please feel free to share with your friends and colleagues and receive one month of Premium subscription for free 🤍


or to participate.