Images, text, audio, video, motor signals – AI models now work with everything: generating, analyzing, and editing complex content like videos and simulations. Yet even mixing just two types of data remains a technical challenge for both models and developers.
Our reality is inherently multimodal, so our models must be as well. The race among top AI systems like Gemini, ChatGPT, Claude, and DeepSeek to become all-rounders – coupled with surging global interest in embodied AI – only confirms this trajectory.
Multimodal fusion is the key to building AI that actually understands our world – not just text or images in isolation, but how they work together, the way humans naturally perceive reality.
So today we'll explore how data mixing actually happens, what challenges emerge along the way, and what strategies developers typically employ to merge modalities. We'll also dive into a more detailed data fusion workflow through the lens of a fascinating new approach from Meta AI and KAUST called Mixture of States (MoS). This method mixes data at the state vector level within each layer using a learnable router.
Now, let's start from the basics!
In today’s episode, we will cover:
Multimodal fusion and alignment
Main types of multimodal fusion
When does the data combination happen?
How does the multimodal fusion happen?
How to build a multimodal model? The key architectures
What’s new in multimodal fusion? Mixture of States (MoS)
How does MoS work?
Implementation and performance boost
Not without limitations
Conclusion / Why is multimodal fusion important?
Sources and further reading
Multimodal fusion and alignment
Multimodal data fusion typically serves two main goals:
Modalities complement each other – for example, an image can add visual context to a text description, enabling more accurate predictions.
Some modalities lack data – when one type of data is scarce, information from another modality can help fill the gap through knowledge transfer.
There are two fundamental processes critical to multimodal data quality that remain challenging in multimodal learning:
Alignment ensures that data from different sources are synchronized and correctly matched – for example, a caption describes the right image, or a video frame corresponds to the correct text description. Alignment helps maximize the utility of whatever data is available. Once achieved, the next step is →
Fusion combines the aligned pieces of information into a single, complete prediction. Fusion creates a unified representation that captures more detail than any single modality alone. It determines when and how the data is mixed.
When alignment and fusion work together, they unlock powerful multimodal applications: image-text matching, video understanding, multimodal QA, cross-modal retrieval (like using text to find an image), combining facial expressions with voice tone for emotion recognition, and more.
Today we'll focus on the more complex part – fusion – which enables richer interaction between modalities. So let's explore when and how it occurs.
Main types of multimodal fusion
When does the data combination happen?
This classification pinpoints the timing:
Early fusion merges modalities at the feature extraction stage, right at the start. It excels at capturing interactions early but can be sensitive to noise.
Late fusion combines predictions from separate models at the end and proves especially useful when some modalities may be missing.
Hybrid fusion blends early and late fusion approaches.
In reality, these categories blur together – actual fusion often happens at multiple levels simultaneously, making how fusion occurs the more critical question.
How does the multimodal fusion happen?
One of the common architectures for AI models is an encoder-decoder architecture, where an encoder extracts meaningful features from the input, and a decoder transforms them into the desired output. This is an entire transformation process and multimodal fusion can appear at different stages of it:
Data-level fusion: All raw data from different modalities is combined before being fed into a shared encoder. For example, in autonomous driving we need to combine raw camera images with LiDAR point clouds for better accuracy.
Feature-level fusion: Firstly, each modality is encoded separately with its own specialized encoder. Then, features from different layers (early, middle, late) are fused together. In this case, we get the mixture of fine-grained and abstract information.
Model-level fusion: Each modality has its own complete model, and fusion happens only at the outputs level after the decoding stage.

Image Credit: Multimodal Alignment and Fusion: A Survey
There are some broader, more advanced ways to mix modalities, because in practice models need a few tricks to make it work properly. These methods include:
Attention-based fusion: Uses attention mechanisms (queries, keys, and values) to control which parts of each modality the model should focus on. The model dynamically weighs and selects the most relevant cross-modal features.
Here, each modality is encoded separately, a connector or adapter (MLP, attention module, Q-Former) maps them into a shared space, and then a Transformer, or LLM fuses them commonly with multi-head attention.
Some of the models that use this type of fusion: CLIP, ViLT, Qwen-VL, VILA, BLIP.

Image Credit: Multimodal Alignment and Fusion: A Survey
Graphical fusion: Represents multimodal data as graphs.
The main thing why many choose graphs is because they capture structured relationships between modalities:
Nodes = data elements (words, pixels, samples)
Edges = relationships (spatial, temporal, semantic)
Traditional fusion methods often used linear combinations, but modern approaches also apply non-linear graph operators that can model much deeper and multi-stage cross-modal interactions, create modality-invariant embeddings and work well in cases where some data is missing.
#Prime example: Alzheimer’s disease diagnosis, which integrates MRI + PET scans using heterogeneous graphs to capture complex brain connectivity. Another example –recommendation systems, where graphs combine text, images, and user interactions to improve predictions.
Kernel-based fusion
These fusion methods also work well with non-linear relationships between modalities, and here is why.
A kernel, in machine learning, is a function that measures similarity between two data points. The idea of kernel-based fusion is not to compare data in its original space, but to map it into a higher-dimensional feature space via a kernel, where relationships become easier to model. In general, kernel fusion improves semantic matching between images, text, and metadata.
Where can it be used? Kernel methods with optimized bandwidth handle noise better than classical fusion techniques, so they are used in audio–visual voice activity detection and in Kernel Cross-Modal Factor Analysis (KCFA) that learn joint transformations to align and fuse audio and visual features. Plus, kernel fusion works well in drug discovery where you need to integrate various chemical structures, protein sequences, and biological assays to improve accuracy.
These are the key fundamental methods used to fuse modalities, but there are many more customized approaches that researchers develop for high-quality data fusion.
Our next point is about constructing the model that can handle multimodality.
How to build a multimodal model? The key architectures
Generally, most multimodal model architectures fall into three common categories. Let's examine them using image-text fusion as an example for clarity.

Image Credit: Multimodal Alignment and Fusion: A Survey
Two-tower (e.g., CLIP, SigLIP models) – the simplest architecture, where images and text are processed separately, then combined using simple operations. Since the two modalities barely interact, the fusion is shallow and relatively ineffective.
Two-leg – includes a fusion network to combine image and text features/embeddings.
One-tower – the most advanced design, which uses a shared encoder (or an integrated encoder-decoder) to process both modalities together from the start. This naturally projects images and text into a shared representation space. When the relationships between modalities are already well understood, this approach becomes both simple and efficient. Many multimodal models – such as LLaVA, Qwen's vision-language model, and BLIP-2 – fit this architecture.
Now we'll explore an advanced multimodal model with an architecture that interestingly echoes the two-leg design. Meet Mixture of States →
What’s new in multimodal fusion? Mixture of States (MoS)
Recently Meta AI and KAUST introduced a very interesting method that fuses modalities in diffusion models via a token-wise router. It is called Mixture of States (MoS). The key word here is “states” because in MoS, the fusion unit is the state, not the layer or the attention pattern. MoS mixes directly the hidden states from all layers of the encoder, or the understanding tower, as the researchers call it. But why is this something new?
The researchers looked into the roots of the problem and highlighted the core mismatch between text and vision data in diffusion models. Text-to-image diffusion models work by gradually turning noise into an image over many small steps. The model’s visual features change as noise disappears, but the text encoder only gives one fixed embedding of the prompt. So visual side is dynamic, while text is usually static.
Some multimodal fusion methods are generally used to get through this problem, such as:
Cross-attention that adds extra attention blocks which project text embeddings into key-value vectors, extracted from the final state hidden state. Text and image tokens interact, but only through those added projection layers. Cross-attention focuses on a single-layer, projection-limited representation of the text.
Self-attention fusion puts text and image tokens into one long sequence and runs joint self-attention over everything. It allows deeper, bidirectional mixing, but becomes very expensive because attention scales quadratically with sequence length. It is also based on final-layer text embeddings.
Mixture-of-Transformers (MoT) connects the text and image models layer by layer by sharing attention modules block-by-block. This gives a good structure but forces both parts – text and visual – to have the same symmetric architecture (which is quite unrealistic): the text and image transformers must have the same hidden size and matching layer counts.

Image Credit: Mixture of States original paper
But the issue is still there: text itself remains static. Meta AI and KAUST researchers identified these correlations:
Diffusion changes over time, so text guidance should adapt accordingly.
The optimal layer isn't always the final one – useful information is distributed across layers.
Different words need different layers – not all tokens should share the same representation.
MoS offers a way to address these issues.
How does MoS work?
MoS is inspired by a broader trend in AI – dynamic computation, where the model adapts its internal routing depending on the input. Similar methods are: Mixture-of-Experts (MoE), where tokens pick which sub-network to use; Mixture-of-Depths (MoD) coordinating if tokens use deeper or shallower computation paths; and Mixture-of-Recursions (MoR) that allows to reuse layers more times when needed.
MoS borrows ideas from these dynamic networks but applies them between models. The unit being mixed is the state vector at each layer. It adds a learnable router that picks useful text features as the image develops (at each denoising step), and decides how to combine and adapt them based on the current level of noise and the current visual features. This gives MoS access to all the text model’s layers:
early-layer syntactic info
mid-layer relational info
high-layer semantic info
token- and layer-specific signals across the entire encoder
MoS uses a dual-tower architecture:
Understanding tower (U) – in other words, an encoder – processes the context: text for text-to-image generation, or text + image for image editing.
Generation tower (G) – the diffusion transformer that actually generates the image.

Image Credit: Mixture of States original paper
These two towers can have different depths (different amount of layers).
The idea is U extracts contextual features, G uses these features to guide the visual generation, and the key to MoS – a learnable router (R) – governs collaboration between the two towers. It controls which hidden states flow from U to G, at each layer and each timestep.
The router works token by token:
It takes three inputs:
timestep, giving time-dependent information, encoded using sinusoidal embeddings
context embedding, processed by U
the current state of the image (noisy, latent state)
All inputs are projected into the same hidden size and treated as token sequences so that the router can process them.
The output: For each context token, the router predicts a token-wise routing matrix, showing how strongly U’s layer i should influence G’s layer j.

Image Credit: Mixture of States original paper
Router in MoS has a lightweight design, where all input embeddings go through the flow “tokenized → normalized → concatenated into a sequence”. Two transformer blocks with bidirectional self-attention pick up the in-context meaning, and then the router outputs the routing logits for each token.
Softmax normalizes the logits and the router uses sparse top-k routing to selects best fit layers from U.
Then it sends them to the right layers of G.
ε-greedy exploration strategy improves stability and helps the router to learn better routing over time. With probability ε, it chooses random k layers (exploration), and with probability 1−ε, it uses top-k (exploitation).
MoS follows the idea of a unified system but for training it chooses a simpler multi-stage path instead of training everything together:
U stays frozen. When we freeze pretrained text/image encoder, this saves a lot of compute.
Only multimodal parts, G and R, are trained.
This way, MoS avoids common problems of joint training, like imbalanced data, conflicting objectives and other bottlenecks. As for the general strategy, MoS is trained end-to-end using rectified flow matching − the model just learns the straight-line “velocity” from random noise to the real image instead of predicting noise.
Implementation and performance boost
There are two variants of MoS now:
MoS-S (small) with PLM-8B (8B parameters) as Understanding tower, 3B Generation tower and ~100M parameter router.
MoS-L (large) has InternVL-14B Understanding tower, 5B Generation tower and also ~100M parameter router.
The researchers tested MoS on two tasks:
MoS-Image − text-to-image generation. MoS generates images by letting each generation block receive a sparse, weighted mixture of hidden states from the understanding tower.
MoS-Edit − instruction-based image editing. It simply extends the input to the understanding tower to include a reference image.

Image Credit: MoS-Edit, Mixture of States original paper
The results speak for themselves:
Both MoS-S and MoS-L achieve state-of-the-art performance, even though they have only 3B–5B parameters. They match or beat models that are 4× larger, such 12–20B parameter SANA, Flux, Bagel, and even Qwen-Image. For example, on GenEval, MoS-L gains 0.90 vs. 20B Qwen-Image's 0.87 which is best overall.

Image Credit: Mixture of States original paper
MoS outperforms cross-attention, self-attention, and MoT baselines across all metrics.

Image Credit: Mixture of States original paper
In image editing, MoS-L also gets best results among open models – 4.33 on ImgEdit and 7.86 on GEdit.
Using reasoning-based captions (self-CoT) boosts WISE scores: MoS-L: 0.54 → 0.65 and MoS-S: 0.47 → 0.55.
Dynamic, token-specific, timestep-aware routing also gives clear gains in FID, CLIP, GenEval, and DPG.
And one more important thing: The router is tiny relative to the towers and adds almost no overhead – only 0.008s per iteration and lowers end-to-end latency vs. Qwen-Image and Bagel.

Image Credit: Mixture of States original paper
Here we arrive at the conclusion about the effectiveness of this new Meta and KAUST’s approach.
Advantages
MoS makes fusion dynamic (it changes with the denoising step), token-specific, and flexible (any text layer can connect to any visual layer) via a learnable and adaptive router.
It uses full hidden states, not just key–value pairs as popular attention mechanisms. Each layer’s entire representation can be reused.
Compared to MoT, MoS removes rigid architectural symmetry needs. Since MoS works well with asymmetric architectures, text and image models don’t need to match layer-by-layer.
The big encoder runs once, so MoS is more efficient.
It is more powerful. In image generation and editing, it performs as well as or better than models 4× larger (the performance results are proof of this).
However, there are still open questions that introduce limitations for MoS, even compared to MoT.
Not without limitations
MoS now supports only one-way routing – from understanding tower → generation tower. There is no bidirectional interaction like in MoT, and no early-fusion support.
The model only uses supervised fine-tuning (SFT) after pretraining. MoS outputs are not optimized to match human preferences with CoT-based multimodal alignment, GRPO-style optimization, or RLHF techniques.
Additional improvement optimizations like quantization, distillation, feature caching were not explored yet.
MoS struggles with tiny details like other diffusion transformers.
And probably, the main one – how and why the router chooses certain layers remains a “black box.”
Overall, it’s a fresh look at how to scale multimodal diffusion.
Conclusion: Why is multimodal fusion important?
Each modality brings something unique for AI models:
images → rich visual detail
text → high-level descriptions
audio → tone, emotion, timing
sensors → depth, structure, geometry
Multimodal fusion is the key to leveraging the strengths of each modality while compensating for the weaknesses of any single one. It's what enables models to capture the world from different perspectives – and it's essential for embodied AI, where data flows from physical interactions as well.
Mastering the fundamentals creates the foundation for building the most advanced systems. Mixture of States (MoS) is a prime example: draw inspiration from various Mixture-of-… methods, add a fresh counterpoint to existing fusion approaches, and you get smart routing for efficient image-text fusion.
The next frontier? Methods for seamless multimodal integration – merging audio, video, and sensor signals into one coherent system with adaptive, scalable fusion that works across any modality combination.
Sources and further reading
Resources from Turing Post

