Project Website: https://hippomultimodalmemory.github.io/
HippoVlog is a novel benchmark dataset designed for evaluating Multimodal Memory and Reasoning (MMR) systems. It consists of 25 long-form daily vlogs (682 minutes total) with naturalistic audiovisual content and 1,000 validated multiple-choice question-answer pairs.
HippoVlog addresses key limitations in existing multimodal understanding benchmarks by providing:
- Long-form continuous videos with naturalistic content
- Ground truth answers for objective evaluation
- Rich audio-visual integration challenges
- Temporal reasoning requirements
The dataset includes four types of memory-focused questions:
-
Cross-modal binding (
$T_{V \times A}$ ): Linking visual/auditory cues -
Auditory-focused retrieval (
$T_A$ ): Recalling audio details -
Visual-focused retrieval (
$T_V$ ): Extracting visual details -
Semantic/Temporal reasoning (
$T_S$ ): Integrating information over time
The dataset consists of:
- Video files (25 long-form vlogs) - Download instructions below.
questions.jsonl
: Contains 1,000 validated multiple-choice questions with:- Question text
- Four candidate answers
- Correct answer
- Explanation
- Video ID and timestamp
- Category
- Python 3.8+
- yt-dlp
- A cookies file for YouTube access
- Download the videos from YouTube:
chmod +x download.sh ./download.sh
Note: You'll need to provide a cookies.txt
file for YouTube access. Ensure the download.sh
script itself does not contain identifying information.
The questions.jsonl
file contains entries in the following format:
{
"question_text": "When the main character mentions feeling 'weightless after I've had a drink,' what is visually happening in the scene?",
"options": {
"A": "The main character is walking down a busy street wearing a denim jacket with a backpack.",
"B": "The main character is sitting in a park enjoying a croissant.",
"C": "The main character is Browse in a clothing store.",
"D": "The main character is sitting at a ramen restaurant."
},
"correct_answer": "A",
"explanation": "...",
"video_id": "HLDPA3FTUJ4",
"category": "audio_visual",
}