• Turing Post
  • Posts
  • FOD#46: What is Mamba and can it beat Transformers?

FOD#46: What is Mamba and can it beat Transformers?

+ new research extending and improving the Mamba architecture and the best curated list of the freshest ML news and papers

Next Week in Turing Post:

  • Wednesday, FMOps series: FMOps Infrastructure Tools

  • Friday: a very interesting interview and list of useful resources about time series.

Turing Post is a reader-supported publication. To have full access to our most interesting articles and investigations, become a paid subscriber →

While everyone is focusing on the hot news about Microsoft's acqui-hiring Inflection AI in disguise and the shakeup at Stability AI, we'd like to concentrate on the exciting developments unfolding in the world of model architectures. For the hot news, check out our News from the Usual Suspects © section below, where we offer the best external coverage of the matter.

Now, let's talk about Mamba – a new architecture that rivals the famous Transformer-based models. Mamba's innovations address significant challenges in processing long sequences, a problem that has limited traditional models. 

So what is it? Mamba leverages state-space models (SSMs)*, particularly excelling with its incorporation of Structured State Space (S4) models into a large language model (LLM) framework. This integration allows Mamba to achieve linear complexity scaling with sequence length, marking a significant advancement over the quadratic scaling seen in traditional Transformer-based models. Its streamlined architecture incorporates selective SSM layers, enhancing both efficiency and flexibility. As a result, Mamba efficiently processes extremely long sequences, surpassing earlier models in performance. Additionally, it benefits from hardware-aware optimizations, maximizing the potential of contemporary GPU architectures.

This means you can process much longer sequences without hitting memory or compute bottlenecks. Think about applications like genomic analysis, long-form content generation, and complex multi-modal data processing, all becoming more feasible with Mamba's power.

*State-space models are mathematical frameworks that describe a system's dynamics in terms of its state variables and observations, capturing the evolution and uncertainty of processes over time.  SSMs are known for efficiency with long sequences.

Mamba's ability to efficiently process long sequences while maintaining competitive performance has fueled research interest in adapting and extending the architecture for various domains. It seems Mamba’s architecture is getting more attention (Attention is all you need ;) – last week, three papers showcased exciting developments. 

The paper, EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba, makes Mamba more suitable for deployment on resource-constrained devices by introducing an efficient 2D scanning method and a dual-pathway module for balanced global-local feature extraction. Results show a significant reduction in FLOPs while maintaining strong accuracy.

The paper, Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference, extends Mamba to be a multi-modal large language model capable of jointly reasoning over vision and language. Experiments demonstrate competitive performance on vision-language tasks with faster inference speeds compared to Transformer-based models.

The paper, SiMBA: Simplified Mamba-based Architecture for Vision and Multivariate Time series, presents a simplified Mamba-based architecture that addresses stability issues when scaling Mamba to larger sizes. The key innovation is EinFFT, a novel channel mixing technique that ensures stable optimization. SiMBA shows strong results on vision tasks and multivariate time series forecasting, closing the gap with state-of-the-art Transformers.

These three papers highlight the architectural flexibility and potential of the Mamba model, which is promising for future advancements in context window size and data type support. If you want to move beyond Transformers, that might be the way to go. You can find the Mamba repository here: https://github.com/state-spaces/mamba

Twitter Library

News from The Usual Suspects ©

Microsoft is hungry

  • Microsoft basically owns OpenAI (“Indeed, as the November 2023 drama was unfolding, Microsoft’s CEO boasted that it would not matter “[i]f OpenAI disappeared tomorrow.” He explained that “[w]e have all the IP rights and all the capability.” “We have the people, we have the compute, we have the data, we have everything.” “We are below them, above them, around them.” – from Elon Musk’s lawsuit against OpenAI).

  • Last February, Microsoft invested in Mistral and brought its newest AI model, Mistral Large, to Azure.

  • Last week, Microsoft basically acqui-hired Inflection AI’s team, including two of its co-founders: Mustafa Suleyman and Karén Simonyan. Eric Newcomer points out at Microsoft’s creativity in non-acquisition strategies, such as partnering with and investing in companies like Inflection and OpenAI without formal acquisitions, sidestepping antitrust reviews. The Stratechery explains why many aspects of Inflection and its rapid ascent to Unicorn status were odd from the very beginning. Soma Somasegar from Madrona VC thinks that Microsoft partners with Inflection.ai to innovate AI for consumers, aiming to transform its presence in the consumer market.

Stability AI evokes many jokes about instability

  • This January, we published Stability AI profile with the subtitle: “Investigating the Thin Line Between Genuine Innovation and Strategic Exaggeration in AI”. On March 23, after too many exaggerations Emad Mostaque resigns from his position as CEO of Stability AI. According to him, he wants to concentrate on decentralized AI. What does it mean exactly – no one knows.


Hugging Face

  • announces the release of Common Corpus, the largest public domain dataset designed for training (LLMs. Encompassing 500 billion words across multiple languages including English, French, Dutch, Spanish, German, and Italian, it's a multilingual collection sourced from diverse cultural heritage initiatives. This release aims to demonstrate the feasibility of developing open LLMs using copyright-free materials, supported by an international collaboration with organizations dedicated to open science in AI.

Sam Altman is open-sourcing his Orb

  • The Worldcoin Foundation has made the Orb’s software (Orb is device that scans irises to create unique digital IDs) open-source, aiming to enhance privacy and security in proving humanness online. The software, available under MIT/Apache 2.0 licenses on GitHub, supports World ID verification by processing images locally and ensuring data privacy through secure transfers.

The freshest research papers, categorized for your convenience

Our top

Larimar: Large Language Models with Episodic Memory Control

Researchers from IBM AI Research developed Larimar, a novel architecture for enhancing LLMs with a distributed episodic memory, inspired by the brain's learning mechanisms. Larimar allows dynamic, one-shot knowledge updates without extensive re-training, offering significant speed improvements (4-10x faster, depending on the LLM used) and flexibility. It also introduces mechanisms for selective fact forgetting and input context length generalization, demonstrating its effectiveness across various benchmarks. This approach presents a significant advancement in efficiently updating knowledge within LLMs →read the paper

RewardBench: Evaluating Reward Models for Language Modeling

Researchers from the Allen Institute for Artificial Intelligence, University of Washington, and Berkman Klein Center at Harvard Law introduce RewardBench, a benchmark for evaluating reward models (RMs) essential in aligning pre-trained models to human preferences through reinforcement learning from human feedback (RLHF). Despite the crucial role of RMs, their evaluation has been scarce. RewardBench aims to fill this gap by providing a comprehensive dataset and codebase for RM evaluation, covering chat, reasoning, and safety aspects to assess RM performance on structured and out-of-distribution queries. The findings reveal insights into the propensities, reasoning limitations, and instruction-following shortcomings of various RMs, contributing to a deeper understanding of the RLHF process →read the paper

Moirai: A Time Series Foundation Model for Universal Forecasting

Researchers from Salesforce AI Research developed Moirai, a groundbreaking time series foundation model designed for universal forecasting. Moirai uniquely addresses the challenges of diverse time series forecasting by constructing a large-scale dataset called LOTSA, spanning 27 billion observations across nine domains. It introduces multiple patch size projection layers, an any-variate attention mechanism, and a mixture distribution modeling approach, enabling it to perform zero-shot forecasting across different domains, frequencies, and variables with competitive or superior accuracy compared to specialized models →read the paper

Enhanced Model Training and Adaptation Techniques

  • RAFT: Adapting Language Model to Domain Specific RAG introduces a method to improve models' ability to utilize domain-specific knowledge by training them to discern and utilize relevant documents, enhancing domain-specific question-answering capabilities. Read the paper.

  • Evolutionary Optimization of Model Merging Recipes explores leveraging evolutionary algorithms to merge diverse open-source models efficiently, creating powerful models without extra data or resources, like a Japanese LLM with Math reasoning abilities. Read the paper.

  • PERL: Parameter Efficient Reinforcement Learning from Human Feedback offers a computationally efficient method to align LLMs with human preferences using Low-Rank Adaptation, making reinforcement learning from human feedback more accessible. Read the paper.

  • DiPaCo: Distributed Path Composition proposes a novel architecture for distributed machine learning, enhancing large-scale model training by minimizing communication between devices, promising for robust and efficient learning on heterogeneous networks. Read the paper.

  • LLAMAFACTORY: Unified Efficient Fine-Tuning of 100+ Language Models introduces a user-friendly platform for efficiently fine-tuning a wide range of LLMs, streamlining the adaptation to various tasks with minimal resources. Read the paper.

  • Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models showcases a method that improves LLMs' performance on agent tasks without compromising general abilities, addressing data distribution shifts and learning speed variations. Read the paper.

Advancements in Multimodal Understanding and Interaction

  • Uni-SMART: Universal Science Multimodal Analysis and Research Transformer significantly enhances the analysis of scientific literature by understanding its multimodal content, outperforming text-focused LLMs in interpreting complex scientific data. Read the paper.

  • Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs develops a method to boost VLMs' reasoning capabilities by transferring skills from LLMs, enhancing the understanding and processing of complex charts without the need for OCR systems. Read the paper.

  • MATHVERSE: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? evaluates MLLMs' ability to understand visual diagrams in math problems, highlighting current limitations and guiding future developments in visual math problem-solving. Read the paper.

Novel Approaches in Language and Video Processing

  • VideoAgent: Long-form Video Understanding with Large Language Model as Agent introduces an innovative method for long-form video understanding, using an LLM as an agent to iteratively compile key information, mimicking human cognitive processes. Read the paper.

  • Recurrent Drafter for Fast Speculative Decoding in Large Language Models presents a novel decoding method to improve LLM inference speed without significant trade-offs, combining draft models and unified strategies for efficient generation. Read the paper.

  • Mora: Enabling Generalist Video Generation via A Multi-Agent Framework advances generalist video generation with a multi-agent framework, indicating a promising direction for future research despite challenges in dynamic object movement tasks. Read the paper.

  • Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition offers a novel video generation model that significantly reduces computational costs by decomposing videos into content and motion representations, improving generation quality and efficiency. Read the paper.

Innovations in Text Mining and Data Processing

  • TnT-LLM: Text Mining at Scale with Large Language Models develops a framework for automating text mining, generating label taxonomies, and classifying texts with minimal manual intervention, enhancing efficiency and accuracy at scale. Read the paper.

  • LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression introduces a data distillation approach for efficient prompt compression, retaining essential details for improved processing speed and accuracy across tasks. Read the paper.

Addressing Model Limitations and Safety

  • When Do We Not Need Larger Vision Models? challenges the default strategy of increasing model size, introducing a technique that utilizes a smaller model across multiple image scales, achieving superior performance with fewer parameters, suggesting an efficient alternative to scaling up model size. Read the paper.

  • Reverse Training to Nurse the Reversal Curse addresses LLMs' difficulty with understanding reversed factual statements by proposing a training strategy that includes both original and reversed textual data, effectively mitigating the "Reversal Curse." Read the paper.

  • Evaluating Frontier Models for Dangerous Capabilities introduces a comprehensive evaluation program for assessing AI systems' "dangerous capabilities," focusing on areas critical to AI safety and security, to understand and mitigate potential risks. Read the paper.

  • Recourse for Reclamation: Chatting with Generative Language Models explores a dynamic thresholding mechanism for GLMs to enhance usability and control in toxicity filtering, offering more nuanced control over model interactions and aligning better with human values. Read the paper.

Pioneering Applications

  • TacticAI: An AI Assistant for Football Tactics leverages geometric deep learning to revolutionize football coaching strategies, generating tactical recommendations for corner kicks and showcasing the potential to significantly transform sports analytics. Read more about TacticAI.

Become a Premium subscriber and expense this subscription through your company. Join hundreds of forward-thinking professionals. Please also send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve. 🤍 Thank you for reading

How was today's FOD?

Please give us some constructive feedback

Login or Subscribe to participate in polls.

Join the conversation

or to participate.