Memory-MambaNav: Enhancing Object-Goal Navigation through Integration of Spatial-Temporal Scanning with State Space Models

Leyuan Sun
Yusuke Yoshiyasu


The proposed Memory-MambaNav can reduce the GPU memory up tp 67.5% comapred with Transformer-based memory accumulated approach when computing over 32 historical frames.

Abstract

Object-goal Navigation (ObjectNav) involves locating a specific target object instance using a textual command combined with semantic understanding in an unknown environment. This requires the embodied agent to have advanced spatial and temporal feature comprehension. While earlier approaches made progress in spatial modeling, they fell short in effectively utilizing episodic temporal historical memory (e.g., keeping track of explored and unexplored space), as long-horizon memory knowledge is resource-intensive in both storage and training. To address this issue, this paper introduces the Memory-MambaNav model, which employs multiple Mamba-based layers for refined spatial-temporal modeling. Leveraging the Mamba architecture, known for its global receptive field and linear complexity, Memory-MambaNav can efficiently extract and process memory knowledge from accumulated historical observations. To enhance spatial modeling, we introduce the Memory Spatial Difference State Space Model (MSD-SSM) to address the limitations of previous CNN and Transformer-based models in terms of field of view and computational demand. For temporal modeling, the proposed Memory Temporal Serialization SSM (MTS-SSM) leverages Mamba's temporal scanning capabilities, enhancing the model's temporal understanding and interaction with bi-temporal features. We also integrate egocentric obstacle-awareness embeddings and memory-based fine-grained rewards into our end-to-end policy training, which improve obstacle understanding and accelerate convergence by fully utilizing memory knowledge. Our experiments on the AI2-Thor dataset confirm the benefits and superior performance of proposed Memory-MambaNav, demonstrating Mamba's potential in ObjectNav, particularly in long-horizon trajectories.


Demonstrations

VTNet, ICLR 2021

IOM, ACM MM 2023

MT, ICLR 2023

Ours, M8-MambaNav

Ours, M24-MambaNav

Kitchen
Bedroom
Living room

Trajetory comparisons

Encoder-spatial and Decoder-temproal attention visulizations

M8-MamabNav, successfully fininsed with 7 steps.

Up: spatial attention, Bottom: Temporal attention (the lastest at the bottom)


Temporal attention in long trajectory

M24-MamabaNav, successfully finished with 50 steps.

  • The latest observations usually receive the highest attention.
  • The fewer objects that have been detected, the less attention they receive, e.g., a wall.
  • When the target-related area has been seen, it receives high attention and is remembered.
  • Sometimes high attentions are in the blank memory grids.


Paper Submission

Leyuan Sun and Yusuke Yoshiyasu
Memory-MambaNav: Enhancing Object-Goal Navigation through Integration of Spatial-Temporal Scanning with State Space Models
Submit to Elsevier, Image and Vision Computing
2024 Imapct Factor: 4.2
(Special issue: Embodied Artificial Intelligence for Robotic Vision and Navigation)
SCImago Journal & Country Rank



Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.