• Turing Post
  • Posts
  • Topic 1: What is Mixture-of-Experts (MoE)?

Topic 1: What is Mixture-of-Experts (MoE)?

we discuss the origins of MoE, why is it better than one neural network, Sparsely-Gated MoE, and sudden hype. Enjoy the collection of helpful links

Introduction

We've received great feedback on our detailed explorations of Retrieval-Augmented Generation (RAG), Transformer architecture, and Multimodal models inside the FMOps series. To keep you updated on the latest AI developments, we're launching a new series called “AI 101.” These subjects often turn into buzzwords with little clear explanation. No worries, we will provide straightforward insights into these complex areas in plain English and help you learn how to implement it.

In this edition, we focus on the Mixture-of-Experts (MoE) model – a fascinating framework that is reshaping how we build and understand scalable AI systems. Several top models are utilizing MoE, including Mistral, Databricks' DBRX, AI21 Labs' Jamba, xAI's Grok-1, and Snowflake’s Arctic. Want to learn more? Dive deep into the history, original thought, and breakthrough innovations in the MoE architecture with us.

In today’s episode, we will cover:

  • History – where it comes from and the initial architecture

  • But why is such modular architecture better than using one neural network?

  • MoE and Deep Learning (MoE + Conditional Computation) - what was the key innovation?

  • MoEs + Transformers and sudden hype

  • Conclusion

  • Bonus: Relevant resources to continue learning about MoEs

Where it comes from and the initial architecture

The concept of Mixture-of-Experts dates back to the 1988 Connectionist Summer School in Pittsburgh where two researchers, Robert Jacobs and Geoffrey Hinton, introduced the idea of training multiple models, referred to as "experts," using specific subsets of training data for each model. These subsets would be created by dividing the original dataset based on subtasks. As a result, each model would specialize in a dedicated subtask. During inference, a network known as "gating" would determine which expert would be used for each training case.

The source identification (SID) network architecture from “The Meta-Pi Network: Building Distributed Knowledge Representations for Robust Multisource Pattern Recognition”. The modules are experts and their outputs are combined based on the "SID combinational superstructure", a gated network.

The idea was further explored in the first research papers but they didn’t fully realize the original efficiency envisioned by Jacobs and Hinton. In 1991, Jacobs and Hinton, along with Michael Jordan from MIT and Steven Nowlan from the University of Toronto, proposed a refinement in the “Adaptive Mixtures of Local Experts” paper, considered an originator of the modern MoE architecture.

They proposed an error function that fostered competition among the experts, promoting true specialization. This function, named the “stochastic one-out-of-n selector” on the diagram, selectively activates a single expert for each specific input case instead of using their linear combination (weighted sum of experts). It prevents the direct coupling of experts' responses, allowing each to specialize in a distinct subtask. This competitive mechanism fosters true expertise within each module making each expert "local," specializing in its designated subtask as originally outlined in the paper.

A visual representation of the original MoE architecture proposed in the “Adaptive Mixtures of Local Experts” paper. As you can see, the final step is a stochastic selector that decides which expert will be used in each case. This is opposed to the previous papers which proposed to use a weighted combination of experts.

But why is such modular architecture better than using one neural network?

The rest of this article, with detailed explanations and best library of relevant resources, is available to our Premium users only –>

Thank you for reading! Share this article with three friends and get a 1-month subscription free! 🤍

Reply

or to participate.