The video-based cross-modal auxiliary network (VCAN) consists of two components: the Audio Feature Map Module (AFMM), which enhances multi-scale acoustic sentiment representation; and the Cross-Modal Selection Module (CMSM), which aims to improve the interactivity of audiovisual modalities and eliminate redundant computations.
Code for the TCSVT paper Video-based Cross-modal Auxiliary Network for Multimodal Sentiment Analysis
These instructions will providee you with a core copy of the VCAN. you can adapt it and run on your local machine for development and testing purposes. This model runs on a Windows system using the PyCharm Community Edition. See following deployment for notes on how to deploy the project on a live system.
What things you need to install the software and how to install them. you can download the reuqirements.txt to install the repated packages. The list of some important installation packages is as follows:
--emd==0.4.0
--librosa==0.8.1
--scikit-learn==0.24.2
...
We only provide the core code of the model, but for the data preprocessing and classification network part of the code is not available. If you need the corresponding demos, please refer to this repository
As shown in Fig.1. in the paper, the output of AFMM is part of the input of CMSM, you must run the AFMM and the CMSM in order after installing the aforementioned packages, i.e.,
- data pre-processing. the adjustment of data types, data structures and corresponding labels.
- Run the AFMM to generate three MEL-spectrams (three kinds of audio feature sequences), which is the acoustic feature input part of CMSM.
- Run the CMSM and output the filtered video keyframes
- Classify video keyframes (images) using designed classifiers or a cooperative classifier group.
You need to declare the related citations if you use the following datasets as benchmark datasets for your papers
- We provide a simple demo of the basic raw data pre-processing so that users can easily adjust the data structure.
- The cooperative classifier group mentioned in our paper is consisted of different CNN models, and users can retrieve the source code from the corresponding papers or refer to this URL.
- If you have any questions about using VCAN, please contact me by email (chenrongfei@shu.edu.cn)
If you think the data set and code we collate are helpful for your research, please cite:
@article{chen2022video,
title={Video-based Cross-modal Auxiliary Network for Multimodal Sentiment Analysis},
author={Chen, Rongfei and Zhou, Wenju and Li, Yang and Zhou, Huiyu},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
year={2022},
publisher={IEEE}
}