Transformers are the backbones to power-up models like BERT, the GPT series, and ViT. However, its attention mechanism has quadratic complexity, making it challenging for long sequences. To tackle this, various token mixers with linear complexity have been developed.
Recently, RNN based models have gained attention for their efficient training and inference on long sequences and have shown promise as backbones for large language models.
Inspired by these capabilities, researchers have explored using Mamba in visual recognition tasks, leading to models like Vision Mamba, VMamba, LocalMamba, and PlainMamba. Despite this, experiments reveal that state space model or SSM-based models for vision underperform compared to state-of-the-art convolutional and attention-based models.
This recent paper does not focus on designing new visual Mamba models. Instead, investigates a critical research question: Is Mamba necessary for visual recognition tasks?
Mamba is a deep learning architecture developed by researchers from Carnegie Mellon University and Princeton University, designed to address the limitations of transformer models, especially for long sequences. It uses the Structured State Space sequence (S4) model, combining strengths from continuous-time, recurrent, and convolutional models to efficiently handle long dependencies and irregularly sampled data.
Recently, researchers have adapted Mamba for computer vision tasks, similar to how Vision Transformers (ViT) are used. Vision Mamba (ViM) improves efficiency by utilizing a bidirectional state space model (SSM), addressing the high computational demands of traditional Transformers, especially for high-resolution images.
Mamba enhances the S4 model by introducing a unique selection mechanism that adapts parameters based on input, allowing it to focus on relevant information within sequences. This time-varying framework improves computational efficiency.
Mamba also employs a hardware-aware algorithm for efficient computation on modern hardware like GPUs, optimizing performance and memory usage. The architecture integrates SSM design with MLP blocks, making it suitable for various data types, including language, audio, and genomics.
Before we start working with the model, we will clone the repo and install few necessary packages,
!pip install timm==0.6.11
!git clone https://github.com/yuweihao/MambaOut.git
!pip install gradio
Additionally, we have added a link that can be used to access the notebook that runs the steps and will perform inferences with MambaOut.
cd /MambaOut
The cell below will help you run the gradio web app.
!python gradio_demo/app.py
The below illustration explains the mechanism of causal attention and RNN-like models from a memory perspective, where _xi_ represents the input token at the i-th step.
(a) Causal Attention: Stores all previous tokens’ keys (k) and values (v) as memory. The memory is updated by continually adding the current token’s key and value, making it lossless. However, the computational complexity of integrating old memory with current tokens increases as the sequence lengthens. Thus, attention works well with short sequences but struggles with longer ones.
(b) RNN-like Models: Compress previous tokens into a fixed-size hidden state (h) that serves as memory. This fixed size means RNN memory is inherently lossy and can’t match the lossless memory capacity of attention models. Nevertheless, RNN-like models excel in processing long sequences, as the complexity of merging old memory with current input remains constant, regardless of sequence length.
Mamba is particularly well-suited for tasks that require causal token mixing due to its recurrent properties. Specifically, Mamba excels in tasks with the following characteristics:
1.The task involves processing long sequences. 2.The task requires causal token mixing.
The next question rises is does visual recognition tasks have very long sequences?
For image classification on ImageNet, the typical input image size is 224x224, resulting in 196 tokens with a patch size of 16x16. This number is much smaller than the thresholds for long-sequence tasks, so ImageNet classification is not considered a long-sequence task.
For object detection and instance segmentation on COCO, with an image size of 800x1280, and for semantic segmentation on ADE20K (ADE20K is a widely-used dataset for the semantic segmentation task, consisting of 150 semantic categories. The dataset includes 20,000 images in the training set and 2,000 images in the validation set), with an image size of 512x2048, the number of tokens is around 4,000 with a patch size of 16x16. Since 4,000 tokens exceed the threshold for small sequences and are close to the base threshold, both COCO detection and ADE20K segmentation are considered long-sequence tasks.
Overall framework of MambaOut
Fig (a) represents overall Framework of MambaOut for Visual Recognition: MambaOut is designed for visual recognition and follows a hierarchical architecture similar to ResNet. It consists of four stages, each with different channel dimensions, denoted as _Di_. This hierarchical structure allows the model to process visual information at multiple levels of abstraction, enhancing its ability to recognize complex patterns in images.
(b) Architecture of the Gated CNN Block: The Gated CNN block is a component within the MambaOut framework. It differs from the Mamba block in that it does not include the State Space Model (SSM). While both blocks use convolutional neural networks (CNNs) with gating mechanisms to regulate information flow, the absence of SSM in the Gated CNN block means it does not have the same capacity for handling long sequences and temporal dependencies as the Mamba block, which incorporates SSM for these purposes.
The primary difference between the Gated CNN and the Mamba block lies in the presence of the State Space Model (SSM).
In MambaOut, a depthwise convolution with a 7x7 kernel size is used as the token mixer of the Gated CNN, similar to ConvNeXt. Similar to ResNet, MambaOut is built using a 4-stage framework by stacking Gated CNN blocks at each stage, as illustrated in Figure.
Before we move further here are the hypothesis regarding the necessity of introducing Mamba for visual recognition. Hypothesis 1: It is not necessary to introduce SSM for image classification on ImageNet, as this task does not meet Characteristic 1 or Characteristic 2. Hypothesis 2: It is still worthwhile to further explore the potential of SSM for visual detection and segmentation since these tasks align with Characteristic 1, despite not fulfilling Characteristic 2.
Mamba mechanism is best suited for tasks with long sequences and autoregressive characteristics. Mamba shows potential for visual detection and segmentation tasks, which do align with long-sequence characteristics. MambaOut models that surpass all visual Mamba models on ImageNet, yet still lag behind state-of-the-art visual Mamba models.
However, due to computational resource limitations, this paper focuses on verifying the Mamba concept for visual tasks. Future research could further explore Mamba and RNN concepts, as well as the integration of RNN and Transformer for large language models (LLMs) and large multimodal models (LMMs), potentially leading to new advancements in these areas.
Thanks for learning with the DigitalOcean Community. Check out our offerings for compute, storage, networking, and managed databases.
This textbox defaults to using Markdown to format your answer.
You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!