Hawkeye: Discovering and Grounding Implicit Anomalous Sentiment in Recon-videos via Scene-enhanced Video Large Language Model
This repository contains the official implementation of our work in ACM MM 2024. More details can be viewed in our paper. [PDF]
In real-world recon-videos such as surveillance and drone reconnaissance videos, commonly used explicit language, acoustic and facial expressions information is often missing. However, these videos are always rich in anomalous sentiments (e.g., criminal tendencies), which urgently requires the implicit scene information (e.g., actions and object relations) to fast and precisely identify these anomalous sentiments. Motivated by this, this paper proposes a new chat-paradigm Implicit anomalous sentiment Discovering and grounding (IasDig) task, aiming to interactively, fast discovering and grounding anomalous sentiments in recon-videos via leveraging the implicit scene information (i.e., actions and object relations). Furthermore, this paper believes that this IasDig task faces two key challenges, i.e., scene modeling and scene balancing. To this end, this paper proposes a new Scene-enhanced Video Large Language Model named Hawkeye, i.e., acting like a raptor (e.g., a Hawk) to discover and locate prey, for the IasDig task. Specifically, this approach designs a graph-structured scene modeling module and a balanced heterogeneous MoE module to address the above two challenges, respectively. Extensive experimental results on our constructed scene-sparsity and scene-density IasDig datasets demonstrate the great advantage of Hawkeye to IasDig over the advanced Video-LLM baselines, especially on the metric of false negative rates. This justifies the importance of the scene information for identifying implicit anomalous sentiments and the impressive practicality of Hawkeye for real-world applications.
You can set up the environments by using conda env create -f environment.yml
.
-
Prepare TSL-300 dataset.
-
Prepare UCF-Crime dataset.
-
Split the videos into frames, extract action features and object-relation features with HigherHRNet and RelTR.
-
Place the features inside the
dataset
folder.- Please ensure the data structure is as below.
├── dataset
└── vid_split
├── 1_Ekman6_disgust_3
├── 1.mp4
├── 2.mp4
└── ...
├── Abuse028_x264
├── 1.mp4
├── 2.mp4
└── ...
└── pose_feat
├── 1_Ekman6_disgust_3
├── frame_1.npy
├── frame_2.npy
└── ...
├── Abuse028_x264
├── frame_1.npy
├── frame_2.npy
└── ...
└── rel_feat
├── 1_Ekman6_disgust_3
├── frame_1.npy
├── frame_2.npy
└── ...
├── Abuse028_x264
├── frame_1.npy
├── frame_2.npy
└── ...
- Download the pretrained vicuna-v1.5 model from Haggingface and place it in the
lmsys
folder. - Download the pretrained LanguageBind model from LanguageBind and place it in the
LanguageBind
folder.
$ bash scripts/v1_5/finetune_lora_a100.sh
After training, the checkpoint will be saved in the output_folder
folder.
Metric | FNRs | F2 | mAP@ 0.1 | mAP@ 0.2 | mAP@ 0.3 | Url |
On TSL Dataset | 35.82 | 38.09 | 35.24 | 21.21 | 14.71 | Google drive |
On UCF-Crime Dataset | 45.66 | 45.03 | 34.41 | 19.22 | 12.1 | Google drive |
You can evaluate the model by running the command below.
python3 eval.py
If you find this work useful, please consider citing it.
@inproceedings{zhao2024hawkeye,
title={Hawkeye: Discovering and Grounding Implicit Anomalous Sentiment in Recon-videos via Scene-enhanced Video Large Language Model},
author={Zhao, Jianing and Wang, Jingjing and Jin, Yujie and Luo, Jiamin and Zhou, Guodong},
booktitle={Proceedings of {ACM MM} 2024},
year={2024}
}