Abstract
Currently, research on events mainly focuses on the task of event extraction, which aims to extract trigger words and arguments from text and is a fine-grained classification task. Although some researchers have improved the event extraction task by additionally constructing external image datasets, these images do not come from the original source of the text and cannot be used for detecting real-time events. To detect events in multimodal data on social media, we propose a new multimodal approach which utilizes text-image pairs for event classification. Our model uses a unified language pre-trained model CLIP to obtain visual and textual features, and builds a Transformer encoder as a fusion module to achieve interaction between modalities, thereby obtaining a good multimodal joint representation. Experimental results show that the proposed model outperforms several state-of-the-art baselines.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Liang, T., Lin, G., Wan, M., Li, T., Ma, G., Lv, F.: Expanding large pre-trained unimodal models with multimodal information injection for image-text multimodal classification. In: CVPR 2022, pp. 15471–15480 (2022)
Abavisani, M., Wu, Li., Hu, S., Tetreault, J.R., Jaimes, A.: Multimodal categorization of crisis events in social media. In: CVPR 2020, pp. 14667–14677 (2020)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML 2021, pp. 8748–8763 (2021)
Vaswani, A., et al.: Attention is all you need. In: NIPS 2017, pp. 5998–6008 (2017)
Alam, F., Ofli, F., Imran, M.: CrisisMMD: multimodal twitter datasets from natural disasters. In: ICWSM 2018, pp. 465–473 (2018)
Ofli, F., Alam, F., Imran, M.: Analysis of social media data using multimodal deep learning for disaster response. In: ISCRAM 2020, pp. 802–811 (2020)
Kim, J., Jun, J., Zhang, B.: Bilinear attention networks. In: NeurIPS 2018, pp. 1571–1571 (2018)
Zhang, T., et al.: Improving event extraction via multimodal integration. In: ACM Multimedia 2017, pp. 270–278 (2017)
Li, M., Zareian, A., Zeng, Q., Whitehead S., Lu, D., Ji, H., Chang, S.: Cross-media structured common space for multimedia event extraction. In: ACL 2020, pp. 2557–2568 (2020)
Tong, M., Wang, S., Cao, Y., Xu, B., Li, J., Hou, L., Chua, T.: Image enhanced event detection in news articles. In: AAAI 2020, pp. 9040–9047 (2020)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT(1) 2019, pp. 4171–4186 (2019)
Wadden, D., Wennberg, U., Luan, Y., Hajishirzi, H.: Entity, relation, and event extraction with contextualized span representations. In: EMNLP/IJCNLP(1) 2019, pp. 5783–5788 (2019)
Lin, Y., Ji, H., Huang, F., Wu, L.: A joint neural model for information extraction with global features. In: ACL 2020, pp. 7999–8009 (2020)
Du, X., Cardie, C.: Event extraction by answering (almost) natural questions. In: EMNLP(1)2020, pp. 671–683 (2020)
Lu, Y., et al.: Text2Event: controllable sequence-to-structure generation for end-to-end event extraction. In: ACL/IJCNLP(1)2021, pp. 2795–2806 (2021)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)
Liu, X., Huang, H., Shi, G., Wang, B.: Dynamic prefix-tuning for generative template-based event extraction. In: ACL(1)2022, pp. 5216–5228 (2022)
Kim, W., Son, B., Kim, I.: ViLT: vision-and-language transformer without convolution or region supervision. In: ICML 2021, pp. 5583–5594 (2021)
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.J.: Stacked attention networks for image question answering. In: CVPR 2016, pp. 21–29 (2016)
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: NIPS 2016, pp. 289–297 (2016)
Kiela, D., Bhooshan, S., Firooz, H., Testuggine, D.: Supervised Multimodal Bitransformers for Classifying Images and Text. In:arXiv preprint arXiv:1909.02950 (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR 2016, pp. 770–778 (2016)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: ICLR 2021 (2021)
Acknowledgements
The authors would like to thank the three anonymous reviewers for their comments on this paper. This research was supported by the National Natural Science Foundation of China (Nos. 62276177 and 61836007), and Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wu, H., Li, P., Wang, Z. (2024). Multimodal Event Classification in Social Media. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1967. Springer, Singapore. https://doi.org/10.1007/978-981-99-8178-6_26
Download citation
DOI: https://doi.org/10.1007/978-981-99-8178-6_26
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8177-9
Online ISBN: 978-981-99-8178-6
eBook Packages: Computer ScienceComputer Science (R0)