HippoMM is a multimodal memory system designed for long audiovisual event understanding, drawing inspiration from the biological processes of hippocampal memory formation and retrieval. It integrates visual, audio, and text modalities to create, consolidate, and query memories in a way that mirrors human cognitive functions.
The system aims to address the challenge of comprehending extended audiovisual experiences by implementing computational mechanisms analogous to hippocampal pattern separation, pattern completion, memory consolidation, and cross-modal associative retrieval.
Active Maintainers: Yueqian Lin, Yudong Liu
HippoMM implements a biologically-inspired multimodal memory system with four main processing stages:
-
Temporal Pattern Separation: Segments continuous audiovisual streams into meaningful units based on visual changes and audio silence detection. This mimics how the hippocampus separates experiences into distinct episodes.
-
Perceptual Encoding: Transforms raw inputs into rich multimodal representations using cross-modal features from video frames and audio segments.
-
Memory Consolidation: Filters redundant segments through similarity-based analysis, preserving only distinct memory events to optimize storage.
-
Semantic Replay: Generates abstract summaries of consolidated memories using Vision-Language Models, transforming detailed perceptual information into semantic representations called
ThetaEvents
.
The system handles queries through a dual-pathway memory retrieval process:
- Fast Retrieval: Quickly checks semantic summaries for information
- Detailed Recall: Performs deeper analysis when fast retrieval isn't sufficient, including feature search, semantic search, and temporal window localization
-
Install ImageBind (Prerequisite):
mkdir -p pretrained cd pretrained git clone https://github.com/facebookresearch/ImageBind.git cd ImageBind pip install . cd ..
Download the pretrained model:
mkdir -p imagebind wget https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth -O imagebind/imagebind_huge.pth cd ../..
-
Clone the repository:
git clone https://github.com/linyueqian/HippoMM.git cd HippoMM
-
Install dependencies: It's recommended to use a virtual environment (e.g.,
conda
orvenv
).# Using conda (example) # conda create -n hippomm python=3.9 # conda activate hippomm pip install -r requirements.txt
-
Configure the system:
- Copy the default configuration:
cp config/default_config.yaml config/config.yaml
- Edit
config/config.yaml
and update it with:- Paths to your foundation models (ImageBind, Whisper, QwenVL).
- API keys and base URLs for any external services (e.g., OpenAI API for reasoning or VLM endpoints).
- Desired storage directories (
storage.base_dir
, etc.).
- Copy the default configuration:
HippoMM requires several model serving endpoints to be running for full functionality. These can be set up using either sglang or vllm:
The system uses Qwen2.5-VL-7B-Instruct for visual-language processing. You can serve it using either:
-
Using vllm (Recommended):
python -m vllm.entrypoints.api_server \ --model pretrained/Qwen/Qwen2.5-VL-7B-Instruct \ --host localhost \ --port 8000
-
Using sglang:
sglang serve \ --model pretrained/Qwen/Qwen2.5-VL-7B-Instruct \ --port 8000
Update the api.qwen.base_url
in your config file to point to the serving endpoint (default: "http://localhost:8000/v1").
The frame processing service handles visual analysis tasks. Set it up similarly:
python -m vllm.entrypoints.api_server \
--model pretrained/Qwen/Qwen2.5-VL-7B-Instruct \
--host localhost \
--port 8001
Update the api.frame_processing.base_urls
in your config file accordingly.
Note: Make sure to have enough GPU memory to serve these models.
-
Process videos/audios in a folder:
python -m hippomm.core.batch_process --path /path/to/videos/or/audios --memory_store /path/to/memory_store
-
Query the memory system:
python -m hippomm.core.ask_question --question "What happened in the video?" --memory_store /path/to/memory_store
-
List available
ThetaEvents
:python -m hippomm.core.ask_question --list --memory_store /path/to/memory_store
If you find this work useful, please cite it as follows:
@misc{lin2025hippommhippocampalinspiredmultimodalmemory,
title={HippoMM: Hippocampal-inspired Multimodal Memory for Long Audiovisual Event Understanding},
author={Yueqian Lin and Qinsi Wang and Hancheng Ye and Yuzhe Fu and Hai "Helen" Li and Yiran Chen},
year={2025},
eprint={2504.10739},
archivePrefix={arXiv},
primaryClass={cs.MM},
url={https://arxiv.org/abs/2504.10739},
}