Abstract
Multimodal Recommendation (MR) exploits multimodal features of items (e.g., visual or textual features) to provide personalized recommendations for users. Recently, scholars have integrated Graph Convolutional Networks (GCN) into MR to model complicated multimodal relationships, but still with two significant challenges: (1) Most MR methods fail to consider the correlations between different modalities, which significantly affects the modal alignment, resulting in poor performance on MR tasks. (2) Most MR methods leverage multimodal features to enhance item representation learning. However, the connection between multimodal features and user representations remains largely unexplored. To this end, we propose a novel yet effective Cross-modal Attention-enhanced graph convolution network for user-specific Multimodal Recommendation, named CAMR. Specifically, we design a cross-modal attention mechanism to mine the cross-modal correlations. In addition, we devise a modality-aware user feature learning method that uses rich item information to learn user feature representations. Experimental results on four real-world datasets demonstrate the superiority of CAMR compared with several state-of-the-art methods. The codes of this work are available at https://github.com/ZZY-GraphMiningLab/CAMR
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The data that support the findings of this study are openly available in Amazon at http://jmcauley.ucsd.edu/data/amazon/links.html.
References
Gu S, Wang X, Shi C, Xiao D (2022) Self-supervised graph neural networks for multi-behavior recommendation. In: Proceedings of the thirty-first international joint conference on artificial intelligence, pp 2052–2058. https://doi.org/10.24963/ijcai.2022/285
Chen J, Zhang H, He X, Nie L, Liu W, Chua T-S (2017) Attentive collaborative filtering: multimedia recommendation with item- and component-level attention. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. https://doi.org/10.1145/3077136.3080797
Chen L, Wu L, Hong R, Zhang K, Wang M (2020) Revisiting graph based collaborative filtering: a linear residual graph convolutional network approach. In: Proceedings of the AAAI conference on artificial intelligence, pp 27–34. https://doi.org/10.1609/aaai.v34i01.5330
He X, Deng K, Wang X, Li Y, Zhang Y, Wang M (2020) Lightgcn: simplifying and powering graph convolution network for recommendation. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pp 639–648. https://doi.org/10.1145/3397271.3401063
Yue G, Xiao R, Zhao Z, Li C (2023) AF-GCN: attribute-fusing graph convolution network for recommendation. IEEE Trans Big Data:597–607. https://doi.org/10.1109/TBDATA.2022.3192598
Li S, Guo D, Liu K, Hong R, Xue F (2023) Multimodal counterfactual learning network for multimedia-based recommendation. In: Proceedings of the 46th international ACM SIGIR conference on research and development in information retrieval, pp 1539–1548. https://doi.org/10.1145/3539618.3591739
Liu K, Xue F, Guo D, Wu L, Li S, Hong R (2023) MEGCF: Multimodal Entity Graph Collaborative Filtering for Personalized Recommendation. ACM Trans Inform Syst:1–27. https://doi.org/10.1145/3544106
Mu Z, Zhuang Y, Tan J, Xiao J, Tang S (2022) Learning hybrid behavior patterns for multimedia recommendation. In: Proceedings of the 30th ACM international conference on multimedia, pp 376–384. https://doi.org/10.1145/3503161.3548119
He R, McAuley J (2016) VBPR: visual Bayesian personalized ranking from implicit feedback. In: Proceedings of the Thirtieth AAAI conference on artificial intelligence, pp 144–150. https://doi.org/10.5555/3015812.3015834
Rendle S, Freudenthaler C, Gantner Z, Schmidt-Thieme L (2009) BPR: Bayesian personalized ranking from implicit feedback. In: Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, pp 452–461. https://doi.org/10.5555/1795114.1795167
Wei Y, Wang X, Nie L, He X, Hong R, Chua T-S (2019) MMGCN: multi-modal graph convolution network for personalized recommendation of micro-video. In: Proceedings of the 27th ACM international conference on multimedia, pp 1437–1445. https://doi.org/10.1145/3343031.3351034
Zhang J, Zhu Y, Liu Q, Wu S, Wang S, Wang L (2021) Mining Latent Structures for Multimedia Recommendation. In: Proceedings of the 29th ACM international conference on multimedia. https://doi.org/10.1145/3474085.3475259
Kim T, Lee Y-C, Shin K, Kim S-W (2022) MARIO: modality-aware attention and modality-preserving decoders for multimedia recommendation. In: Proceedings of the 31st ACM international conference on information & knowledge management, pp 993–1002. https://doi.org/10.1145/3511808.3557387
Zhang J, Zhu Y, Liu Q, Zhang M, Wu S, Wang L (2023) Latent Structure Mining With Contrastive Modality Fusion for Multimedia Recommendation. IEEE Trans Knowl Data Eng:9154–9167. https://doi.org/10.1109/TKDE.2022.3221949
Zhou X, Zhou H, Liu Y, Zeng Z, Miao C, Wang P, You Y, Jiang F (2023) Bootstrap Latent Representations for Multi-modal Recommendation. In: Proceedings of the ACM web conference 2023. https://doi.org/10.1145/3543507.3583251
Wei W, Huang C, Xia L, Zhang C (2023) Multi-modal self-supervised learning for recommendation. In: Proceedings of the ACM web conference 2023. https://doi.org/10.1145/3543507.3583206
Zhang Q, Zhao Z, Zhou H, Li X, Li C (2023) Self-supervised contrastive learning on heterogeneous graphs with mutual constraints of structure and feature. Inform Sci:119026. https://doi.org/10.1016/j.ins.2023.119026
Zhao Z, Yang Z, Li C, Zeng Q, Guan W, Zhou M (2023) Dual Feature Interaction-Based Graph Convolutional Network. IEEE Trans Knowl Data Eng:9019–9030. https://doi.org/10.1109/TKDE.2022.3220789
Yang L, Wang S, Tao Y, Sun J, Liu X, Yu PS, Wang T (2023) DGRec: graph neural network for recommendation with diversified embedding generation. In: Proceedings of the sixteenth ACM international conference on web search and data mining, pp 661–669. https://doi.org/10.1145/3539597.3570472
Cai L, Li J, Wang J, Ji S (2022) Line graph neural networks for link prediction. IEEE Trans Pattern Anal Mach Intell:5103–5113. https://doi.org/10.1109/TPAMI.2021.3080635
Wang X, He X, Wang M, Feng F, Chua T-S (2019) Neural graph collaborative filtering. In: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. https://doi.org/10.1145/3331184.3331267
Liu F, Cheng Z, Zhu L, Gao Z, Nie L (2021) Interest-aware message-passing GCN for recommendation. In: Proceedings of the web conference 2021, pp 1296–1305. https://doi.org/10.1145/3442381.3449986
Cai D, Qian S, Fang Q, Xu C (2022) Heterogeneous hierarchical feature aggregation network for personalized micro-video recommendation. IEEE Trans Multimedia:805–818. https://doi.org/10.1109/TMM.2021.3059508
Liu S, Chen Z, Liu H, Hu X (2019) User-video co-attention network for personalized micro-video recommendation. In: The world wide web conference, pp 3020–3026. https://doi.org/10.1145/3308558.3313513
Yang L, Liu Z, Wang Y, Wang C, Fan Z, Yu PS (2022) Large-scale personalized video game recommendation via social-aware contextualized graph neural network. In: Proceedings of the ACM web conference 2022, pp 3376–3386. https://doi.org/10.1145/3485447.3512273
Yu J, Yin H, Li J, Wang Q, Hung NQV, Zhang X (2021) Self-supervised multi-channel hypergraph convolutional network for social recommendation. In: Proceedings of the web conference 2021, pp 413–424. https://doi.org/10.1145/3442381.3449844
Wang Z, Wei W, Cong G, Li X-L, Mao X-L, Qiu M (2020) Global context enhanced graph neural networks for session-based recommendation. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval, pp 169–178. https://doi.org/10.1145/3397271.3401142
Velickovic P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y (2018) Graph attention networks. In: 6th International conference on learning representations. https://doi.org/10.1007/978-3-031-01587-8_7
Tao Z, Wei Y, Wang X, He X, Huang X, Chua T-S (2020) MGAT: multimodal graph attention network for recommendation. Inform Process Manag:102277. https://doi.org/10.1016/j.ipm.2020.102277
Wang X, He X, Cao Y, Liu M, Chua T-S (2019) KGAT: Knowledge graph attention network for recommendation. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp 950–958. https://doi.org/10.1145/3292500.3330989
Zhou Y, Guo J, Sun H, Song B, Yu FR (2023) Attention-guided multi-step fusion: a hierarchical fusion network for multimodal recommendation. In: Proceedings of the 46th international acm sigir conference on research and development in information retrieval, pp 1816–1820. https://doi.org/10.1145/3539618.3591950
Jing L, Tian Y (2021) Self-supervised visual feature learning with deep neural networks: a survey. IEEE Trans Pattern Anal Mach Intell:4037–4058. https://doi.org/10.1109/TPAMI.2020.2992393
Mahendran A, Thewlis J, Vedaldi A (2019) Cross pixel optical-flow similarity for self-supervised learning. In: Computer vision–ACCV 2018: 14th asian conference on computer vision, pp 99–116. https://doi.org/10.1007/978-3-030-20873-8_7
Liu X, Zhang F, Hou Z, Mian L, Wang Z, Zhang J, Tang J (2021) Self-supervised learning: generative or contrastive. IEEE Trans Knowl Data Eng:857–876. https://doi.org/10.1109/TKDE.2021.3090866
Veličković P, Fedus W, Hamilton WL, Liò P, Bengio Y, Hjelm D (2019) Deep Graph Infomax. In: International conference on learning representations, p 4
Wei W, Huang C, Xia L, Xu Y, Zhao J, Yin D (2022) Contrastive meta learning with behavior multiplicity for recommendation. In: Proceedings of the fifteenth acm international conference on web search and data mining, pp 1120–1128. https://doi.org/10.1145/3488560.3498527
He R, McAuley J (2016) Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In: Proceedings of the 25th international conference on world wide web, pp 507–517. https://doi.org/10.1145/2872427.2883037
Reimers N, Gurevych I (2019) Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In: Proceedings of the 2019 conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp 3982–3992. https://doi.org/10.18653/v1/D19-1410
Chen J, Fang H-R, Saad Y (2009) Fast approximate KNN graph construction for high dimensional data via recursive Lanczos bisection. J Mach Learn Res:1989–2012. https://doi.org/10.5555/1577069.1755852
Wei Y, Wang X, Nie L, He X, Chua T-S (2020) Graph-refined convolutional network for multimedia recommendation with implicit feedback. In: Proceedings of the 28th ACM international conference on multimedia, pp 3541–3549. https://doi.org/10.1145/3394171.3413556
Acknowledgements
This research is supported by the National Natural Science Foundation of China (Grant No. 62472263, 62072288), the Taishan Scholar Program of Shandong Province, Shandong Youth Innovation Team, the Natural Science Foundation of Shandong Province (Grant No. ZR2024MF034, ZR2022MF268).
Author information
Authors and Affiliations
Contributions
Ruidong Wang: Conceptualization, Investigation, Methodology, Writing original draft. Zhongying Zhao: Methodology, Writing - review & editing, Supervision, Funding acquisition. Chao Li: Writing - review & editing.
Corresponding author
Ethics declarations
Conflict of interests
The authors declare that they have no conflict of interest to this work.
Ethical and informed consent for data used
The datasets used for this experiment are publicly available by the respective organizations/authors to further improve the Multimodal Recommendation research field. Thus, informed consent is not required to use the dataset. References and citations to relevant datasets are included in the manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, R., Li, C. & Zhao, Z. Towards user-specific multimodal recommendation via cross-modal attention-enhanced graph convolution network. Appl Intell 55, 2 (2025). https://doi.org/10.1007/s10489-024-06061-1
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-06061-1