A novel Token-disentangling Mutual Transformer(TMT) for multimodal emotion recognition. The TMT can effectively disentangle inter-modality emotion consistency features and intra-modality emotion heterogeneity features and mutually fuse them for more comprehensive multimodal emotion representations by introducing two primary modules, namely multimodal emotion Token disentanglement and Token mutual Transformer. The Models folder contains our overall TMT model code, and the subNets folder contains the Transforme structure used in it.
- Train, test and compare in a unified framework.
- Supports Our TMT model.
- Supports 3 datasets: MOSI, MOSEI, and CH-SIMS.
- Easy to use, provides Python APIs and commandline tools.
- Experiment with fully customized multimodal features extracted by MMSA-FET toolkit.
Note: From version 2.0, we packaged the project and uploaded it to PyPI in the hope of making it easier to use. If you don't like the new structure, you can always switch back to
v_1.0
branch.
At present, only the 6E3D model and test code are uploaded, and the training code will be released after the paper is published.
-
Import and use in any python file:
python test.py python test_acc5.py
-
For more detailed usage, please contact ygh2@cug.edu.cn.
- Clone this repo and install requirements.
$ git clone https://github.com/cug-ygh/TMT
TMT currently supports MOSI, MOSEI, and CH-SIMS dataset. Use the following links to download raw videos, feature files and label files. You don't need to download raw videos if you're not planning to run end-to-end tasks.
- BaiduYun Disk
code: mfet
- Google Drive
SHA-256 for feature files:
`MOSI/Processed/unaligned_50.pkl`: `78e0f8b5ef8ff71558e7307848fc1fa929ecb078203f565ab22b9daab2e02524`
`MOSI/Processed/aligned_50.pkl`: `d3994fd25681f9c7ad6e9c6596a6fe9b4beb85ff7d478ba978b124139002e5f9`
`MOSEI/Processed/unaligned_50.pkl`: `ad8b23d50557045e7d47959ce6c5b955d8d983f2979c7d9b7b9226f6dd6fec1f`
`MOSEI/Processed/aligned_50.pkl`: `45eccfb748a87c80ecab9bfac29582e7b1466bf6605ff29d3b338a75120bf791`
`SIMS/Processed/unaligned_39.pkl`: `c9e20c13ec0454d98bb9c1e520e490c75146bfa2dfeeea78d84de047dbdd442f`
Our uses feature files that are organized as follows:
{
"train": {
"raw_text": [], # raw text
"audio": [], # audio feature
"vision": [], # video feature
"id": [], # [video_id$_$clip_id, ..., ...]
"text": [], # bert feature
"text_bert": [], # word ids for bert
"audio_lengths": [], # audio feature lenth(over time) for every sample
"vision_lengths": [], # same as audio_lengths
"annotations": [], # strings
"classification_labels": [], # Negative(0), Neutral(1), Positive(2). Deprecated in v_2.0
"regression_labels": [] # Negative(<0), Neutral(0), Positive(>0)
},
"valid": {***}, # same as "train"
"test": {***}, # same as "train"
}
Please cite our paper if you find our work useful for your research:
@article{YIN2024108348,
title = {Token-disentangling Mutual Transformer for multimodal emotion recognition},
journal = {Engineering Applications of Artificial Intelligence},
volume = {133},
pages = {108348},
year = {2024},
issn = {0952-1976},
doi = {https://doi.org/10.1016/j.engappai.2024.108348},
url = {https://www.sciencedirect.com/science/article/pii/S0952197624005062},
}