top of page

Mamba: Building More Powerful and Efficient Foundation Models without Transformers

12/24/23

Editorial team at Bits with Brains

Foundation models have emerged as an effective paradigm for building general-purpose AI capabilities that can be adapted to many tasks.

These large deep learning models are typically pretrained on vast amounts of data in an unsupervised manner before being fine-tuned for downstream applications.


The dominant foundation model architecture powering most applications today is Transformer, with its self-attention mechanism enabling complex context modeling across entire input sequences.


While highly effective, the Transformer's attention comes with computational inefficiencies. Self-attention requires quadratic time and memory with respect to the input sequence length, limiting its ability to manage exceptionally long sequences effectively. This has spurred research into more computationally efficient alternatives to attention that can still capture long-range dependencies.

One promising class of models is structured state space models (SSMs). SSMs map input sequences to output sequences using recurrence or convolution operations that scale linearly with sequence length. However, SSMs have faced their own limitations, primarily an inability to selectively filter irrelevant parts of input sequences when modeling discrete, information-dense data like text.


Scientists Albert Gu and Tri Dao from Carnegie Mellon University recently introduced a technique they call "Selective State Spaces" to address these weaknesses. The core assertion is that allowing SSMs to selectively address only relevant parts of input sequences, based on content, can unlock their full modeling power while retaining linear scalability.


Gu and Dao incorporated these selective SSMs into a simplified neural network architecture called Mamba. When evaluated on tasks requiring selection like copying randomly spaced tokens, and on real data benchmarks in language, audio and genomics modeling, Mamba exceeded prior SSMs and even matched powerful Transformer based systems, all while retaining linear time complexity.


The researchers' empirical results demonstrate Mamba can serve as a computationally efficient and scalable alternative to attention in building foundation models.


For technology leaders, this work offers an encouraging datapoint that attention may not be the only viable approach, presenting options to optimize large AI systems for speed and memory usage without sacrificing quality - core concerns for deploying deep learning to diverse real-world applications at scale.


Sources:

https://arxiv.org/abs/2312.00752

Sources

bottom of page