Turing Post
Posts
Topic 1: What is Mixture-of-Experts (MoE)?

Topic 1: What is Mixture-of-Experts (MoE)?

we discuss the origins of MoE, why is it better than one neural network, Sparsely-Gated MoE, and sudden hype. Enjoy the collection of helpful links

Ksenia Se & Valeriia Kuka
May 24, 2024

Introduction

We've received great feedback on our detailed explorations of Retrieval-Augmented Generation (RAG), Transformer architecture, and Multimodal models inside the FMOps series. To keep you updated on the latest AI developments, we're launching a new series called “AI 101.” These subjects often turn into buzzwords with little clear explanation. No worries, we will provide straightforward insights into these complex areas in plain English and help you learn how to implement it.

In this edition, we focus on the Mixture-of-Experts (MoE) model – a fascinating framework that is reshaping how we build and understand scalable AI systems. Several top models are utilizing MoE, including Mistral, Databricks' DBRX, AI21 Labs' Jamba, xAI's Grok-1, and Snowflake’s Arctic. Want to learn more? Dive deep into the history, original thought, and breakthrough innovations in the MoE architecture with us.

In today’s episode, we will cover:

History – where it comes from and the initial architecture
But why is such modular architecture better than using one neural network?
MoE and Deep Learning (MoE + Conditional Computation) - what was the key innovation?
MoEs + Transformers and sudden hype
Conclusion
Bonus: Relevant resources to continue learning about MoEs

Where it comes from and the initial architecture

The concept of Mixture-of-Experts dates back to the 1988 Connectionist Summer School in Pittsburgh where two researchers, Robert Jacobs and Geoffrey Hinton, introduced the idea of training multiple models, referred to as "experts," using specific subsets of training data for each model. These subsets would be created by dividing the original dataset based on subtasks. As a result, each model would specialize in a dedicated subtask. During inference, a network known as "gating" would determine which expert would be used for each training case.

The source identification (SID) network architecture from “The Meta-Pi Network: Building Distributed Knowledge Representations for Robust Multisource Pattern Recognition”. The modules are experts and their outputs are combined based on the "SID combinational superstructure", a gated network.

The idea was further explored in the first research papers but they didn’t fully realize the original efficiency envisioned by Jacobs and Hinton. In 1991, Jacobs and Hinton, along with Michael Jordan from MIT and Steven Nowlan from the University of Toronto, proposed a refinement in the “Adaptive Mixtures of Local Experts” paper, considered an originator of the modern MoE architecture.

They proposed an error function that fostered competition among the experts, promoting true specialization. This function, named the “stochastic one-out-of-n selector” on the diagram, selectively activates a single expert for each specific input case instead of using their linear combination (weighted sum of experts). It prevents the direct coupling of experts' responses, allowing each to specialize in a distinct subtask. This competitive mechanism fosters true expertise within each module making each expert "local," specializing in its designated subtask as originally outlined in the paper.

A visual representation of the original MoE architecture proposed in the “Adaptive Mixtures of Local Experts” paper. As you can see, the final step is a stochastic selector that decides which expert will be used in each case. This is opposed to the previous papers which proposed to use a weighted combination of experts.

But why is such modular architecture better than using one neural network?

In 1991, Jacobs, Jordan, and Barto listed these advantages of modular architectures:

Function decomposition: Modular systems break down complex tasks into simpler parts, each handled by a specialized sub-network, enabling faster and more focused learning.
Generalization: Networks can be designed to closely match the tasks they perform, improving generalization from training data to new scenarios.
Interpretability: Modular architectures generate task-specific representations that are easier to understand and analyze.
Efficiency: The distribution of computational tasks across networks allows the efficient handling of complex, high-dimensional spaces.
Localized Processing: Modular systems optimize computational efficiency and mimic biological processing advantages.

MoE and Deep Learning (MoE + Conditional Computation) – what was the key innovation?

In 2017, the authors of “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer”, who included Noam Shazeer, Quoc Le, Geoffrey Hinton, and Jeff Dean, applied the MoE concept to the NLP setting achieving greater than 1000x improvements in model capacity* with only minor losses in computational efficiency.

*Model capacity in machine learning refers to the ability of a model to capture various patterns and complexities in the data. It's about how much information a model can learn from the training data.

When datasets are sufficiently large, increasing the capacity (number of parameters) of neural networks can give much better prediction accuracy.

Quote from the “Outrageously Large Neural Networks” paper

MoE layer from “Outrageously Large Neural Networks”

The key innovation of their work is the Sparsely-Gated Mixture-of-Experts (MoE) layer. To briefly walk you through the importance of this development, we need to focus on the concept of conditional computation. It was proposed in 2013 in two papers as a way to increase model capacity without a proportional increase in computational costs:

“Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation” by Yoshua Bengio, Nicholas Leonard, and Aaron Courville
“Low-Rank Approximations for Conditional Feedforward Computation in Deep Neural Networks” by Andrew Davis, and Itamar Arel

However, the concept remained largely theoretical before the publication of the paper we’re discussing, “Outrageously Large Neural Networks” who used MoE as a key to realizing the promise of conditional computation.

If you want a highly-technical review of the concept, we recommend you reading this article from Hugging Face engineers. Here, we list innovations of the paper compared to the original MoE:

Sparsity in gating (conditional computation): The sparsely-gated MoE uses a gating mechanism that activates only a small subset of the experts. This results in a much more computationally efficient model, as only a few experts need to be evaluated per input.
Scale of experts: While traditional MoEs were used with a relatively small number of experts, the sparsely-gated MoE scales this concept to thousands or even tens of thousands of experts, allowing for an unprecedented scale of model capacity without a linear increase in computational demands.
Integration with deep learning architectures: The sparsely-gated MoE is specifically designed to be integrated with modern deep learning architectures. It is often embedded within larger neural networks, for example, between layers of a deep LSTM network.
Efficiency and Scalability: The design of the sparsely-gated MoE includes specific innovations to address computational efficiency and scalability, such as managing batch sizes and the distribution of computation across multiple GPUs, which are crucial for handling the massive scale of the networks.

MoEs + Transformers and sudden hype

“Outrageously Large Neural Networks” was pivotal in demonstrating the effectiveness of MoE layers in scaling model capacity through sparsity. Google’s experiments (all included Noam Shazeer) with scaling transformers using MoE pushed the boundaries further:

With GShard in 2020, they scaled the model beyond 600B parameters
With Switch Transformer in 2021, they achieved 1.6T of parameters, showcasing potential in handling extremely large-scale tasks
With GLaM in 2022, they created a family of language models with the largest GLaM having 1.2T parameters, approximately 7x larger than GPT-3, and reportedly consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference, while still achieving better overall zero, one and few-shot performance across 29 NLP tasks.

MoE can be integrated into the Transformer architecture, typically at the feed-forward layers as shown in the diagram of the Switch Transformer encoder block

For some time, the MoE buzz was drowned out by other developments in the generative AI field. But it remained the object of study for researchers while the public was fascinated by other shiny trinkets, as it always happens.

It was the case till June 2023, when George Hotz, security hacker and entrepreneur, leaked that GPT-4 is not a monolithic model but a mixture of 8 experts, each with 220B parameters. Soumith Chintala, co-founder of PyTorch at Meta, and Mikhail Parakhin, Microsoft Bing AI lead, also wrote about it on Twitter.

Was this a hint for Mistral AI or just a simple coincidence, we don’t know. But after half a year, they created a resurgence of interest in Mixture-of-Experts (MoE) architecture, boldly releasing Mixtral 8x7B. They made the model available through a magnet link, commonly used in peer-to-peer (P2P) file-sharing networks, at the end of 2023.

This release has sparked a trend, with other companies also introducing their own MoE-supported models, such as:

Notably, all these models are open, reflecting another new tendency, although not inherent, among for-profit companies.

Conclusion

The concept of Mixture-of-Experts gave us many fascinating models and we believe there is more to come. If you want to dive into technical details and learn about potential improvements that are still not made in the space of MoE, we recommend using our resources list collected below.

Thank you for reading, please feel free to share this article with your friends 🤍

Bonus: Resources

Great reads:

Mixture of Experts Explained by Hugging Face
Mixture of Experts: How an Ensemble of AI Models Act as One by Deepgram
Tutorials:
makeMoE: Implement a Sparse Mixture of Experts Language Model from Scratch on Hugging Face blog
Integrate Mixture-of-Experts Into Your Model by Colossal-AI
MoE implementation in Tensor2Tensor (T2T), an open-source system for training deep learning models in TensorFlow

Research papers:

How did you like it?

Thank you for reading! Share this article with three friends and get a 1-month subscription free! 🤍

Reply

or to participate.