Topic 2: What is Mamba?

We discuss sequence modeling, the drawbacks of transformers, and what Mamba brings into play

Introduction

The Mamba architecture, introduced in December 2023, has become a significant development in sequence modeling. We briefly discussed it previously in our weekly digest, asking, "What is Mamba and can it beat Transformers?"

This article, part of our AI 101 series, explores Mamba's origins, and the issues it addresses, and goes deeper into why it might be a better alternative to transformer models. With its efficient scaling of context length and linear computational costs, Mamba offers a promising solution for handling large-scale data challenges in AI.

In today’s episode, we will cover:

  • Sequence modeling, transformers, and foundation models

  • What’s wrong with transformers?

  • Here come SSMs and Mamba

  • Why is Mamba getting attention?

  • Conclusion

  • Bonus: Mamba-related research papers worth reading:

Sequence modeling, transformers, and foundation models

First, let's understand the context of foundation models, transformers, and related concepts. It’s called sequence modeling, an area of research driven by sequential data, meaning it has some inherent order inside. It can be text sentences, images, speech, audio, time series, genomics, etc.

In the 1990s and 2000s, sequence modeling advanced significantly due to the rise of neural network architectures and increased computing power. Models like Long Short-Term Memory (LSTM) and recurrent neural networks (RNNs) became the standards. These models are excellent at handling variable-length sequences and capturing long-term dependencies. However, their sequential data processing nature makes it challenging to parallelize operations within a single training example.

In 2017, the introduction of the transformer architecture marked a significant change in sequence data processing. Transformers use attention mechanisms, eliminating the need for the recurrence or convolution used in earlier models. This design enhances parallelization during training, making transformers more efficient and scalable for handling large datasets.

The self-attention mechanism in transformers computes the relevance of all other parts of the input sequence to a particular part. This is done for all parts of the sequence simultaneously, which is inherently parallelizable. Each attention head can independently calculate the attention scores for all positions in the input sequence, making these operations run concurrently. This differs drastically from RNNs where outputs are computed one after the other. This results in markedly faster training times and the ability to handle larger datasets more efficiently.

Okay, so Transformers really became the backbone of modern foundation models. As ChatGPT would say, they really propelled their development! But transformers are not the only ones in sequence modeling, and it turned out they have some problems with long context. So it’s time to turn to our main topic – Mamba.

What’s wrong with transformers?

The rest of this article, with detailed explanations and best library of relevant resources, is available to our Premium users only –>

Thank you for reading! Share this article with three friends and get a 1-month subscription free! 🤍

Reply

or to participate.