HippoVlog: A Long-form Vlog Dataset for Multimodal Memory and Reasoning

Project Website: https://hippomultimodalmemory.github.io/

HippoVlog is a novel benchmark dataset designed for evaluating Multimodal Memory and Reasoning (MMR) systems. It consists of 25 long-form daily vlogs (682 minutes total) with naturalistic audiovisual content and 1,000 validated multiple-choice question-answer pairs.

Dataset Overview

HippoVlog addresses key limitations in existing multimodal understanding benchmarks by providing:

Long-form continuous videos with naturalistic content
Ground truth answers for objective evaluation
Rich audio-visual integration challenges
Temporal reasoning requirements

Question Categories

The dataset includes four types of memory-focused questions:

Cross-modal binding ($T_{V \times A}$): Linking visual/auditory cues
Auditory-focused retrieval ($T_A$): Recalling audio details
Visual-focused retrieval ($T_V$): Extracting visual details
Semantic/Temporal reasoning ($T_S$): Integrating information over time

Dataset Structure

The dataset consists of:

Video files (25 long-form vlogs) - Download instructions below.
questions.jsonl: Contains 1,000 validated multiple-choice questions with:
- Question text
- Four candidate answers
- Correct answer
- Explanation
- Video ID and timestamp
- Category

Getting Started

Prerequisites

Python 3.8+
yt-dlp
A cookies file for YouTube access

Downloading the videos

Download the videos from YouTube:
```
chmod +x download.sh
./download.sh
```

Note: You'll need to provide a cookies.txt file for YouTube access. Ensure the download.sh script itself does not contain identifying information.

Data Format

The questions.jsonl file contains entries in the following format:

{
    "question_text": "When the main character mentions feeling 'weightless after I've had a drink,' what is visually happening in the scene?",
    "options": {
        "A": "The main character is walking down a busy street wearing a denim jacket with a backpack.",
        "B": "The main character is sitting in a park enjoying a croissant.",
        "C": "The main character is Browse in a clothing store.",
        "D": "The main character is sitting at a ramen restaurant."
    },
    "correct_answer": "A",
    "explanation": "...",
    "video_id": "HLDPA3FTUJ4",
    "category": "audio_visual",
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
download.sh		download.sh
questions.jsonl		questions.jsonl
video_url.txt		video_url.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HippoVlog: A Long-form Vlog Dataset for Multimodal Memory and Reasoning

Dataset Overview

Question Categories

Dataset Structure

Getting Started

Prerequisites

Downloading the videos

Data Format

About

Uh oh!

Releases

Packages

Languages

linyueqian/HippoVlog

Folders and files

Latest commit

History

Repository files navigation

HippoVlog: A Long-form Vlog Dataset for Multimodal Memory and Reasoning

Dataset Overview

Question Categories

Dataset Structure

Getting Started

Prerequisites

Downloading the videos

Data Format

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages