[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Attribute-Guided Cross-Modal Interaction and Enhancement for Audio-Visual Matching

Published: 15 April 2024 Publication History

Abstract

Audio-visual matching is an essential task that measures the correlation between audio clips and visual images. However, current methods rely solely on the joint embedding of global features from audio clips and face image pairs to learn semantic correlations. This approach overlooks the importance of high-confidence correlations and discrepancies of local subtle features, which are crucial for cross-modal matching. To address this issue, we propose a novel Attribute-guided Cross-modal Interaction and Enhancement Network (ACIENet), which employs multiple attributes to explore the associations of different key local subtle features. The ACIENet contains two novel modules: the Attribute-guided Interaction (AGI) module and the Attribute-guided Enhancement (AGE) module. The AGI module employs global feature alignment similarity to guide cross-modal local feature interactions, which enhances cross-modal association features for the same identity and expands cross-modal distinctive features for different identities. Additionally, the interactive features and original features are fused to ensure intra-class discriminability and inter-class correspondence. The AGE module captures subtle attribute-related features by using an attribute-driven network, thereby enhancing discrimination at the attribute level. Specifically, it strengthens the combined attribute-related features of gender and nationality. To prevent interference between multiple attribute features, we design a multi-attribute learning network as a parallel framework. Experiments conducted on a public benchmark dataset demonstrate the efficacy of the ACIENet method in different scenarios. Code and models are available at <uri>https://github.com/w1018979952/ACIENet</uri>.

References

[1]
V. Bruce and A. Young, “Understanding face recognition,” British J. Psychol., vol. 77, no. 3, pp. 305–327, 1986.
[2]
P. Belin, P. E. Bestelmeyer, M. Latinus, and R. Watson, “Understanding voice perception,” Brit. J. Psychol., vol. 102, no. 4, pp. 711–725, 2011.
[3]
H. M. J. Smith, A. K. Dunn, T. Baguley, and P. C. Stacey, “Matching novel face and voice identity using static and dynamic facial images,” Attention, Perception, Psychophys., vol. 78, no. 3, pp. 868–879, Apr. 2016.
[4]
M. Kamachi, H. Hill, K. Lander, and E. Vatikiotis-Bateson, “Putting the face to the voice’: Matching identity across modality,” Current Biol., vol. 13, no. 19, pp. 1709–1714, 2003.
[5]
A. W. Young, S. Frühholz, and S. R. Schweinberger, “Face and voice perception: Understanding commonalities and differences,” Trends Cognit. Sci., vol. 24, no. 5, pp. 398–410, May 2020.
[6]
R. Gao and K. Grauman, “VisualVoice: Audio-visual speech separation with cross-modal consistency,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 15490–15500.
[7]
K. Yang, D. Markovic, S. Krenn, V. Agrawal, and A. Richard, “Audio-visual speech codecs: Rethinking audio-visual speech enhancement by re-synthesis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 8217–8227.
[8]
A. George, A. Mohammadi, and S. Marcel, “Prepended domain transformer: Heterogeneous face recognition without bells and whistles,” IEEE Trans. Inf. Forensics Security, vol. 18, pp. 133–146, 2023.
[9]
Y. Fanget al., “Attribute prototype learning for interactive face retrieval,” IEEE Trans. Inf. Forensics Security, vol. 16, pp. 2593–2607, 2021.
[10]
A. Gomez-Alanis, J. A. Gonzalez-Lopez, S. P. Dubagunta, A. M. Peinado, and M. M. Doss, “On joint optimization of automatic speaker verification and anti-spoofing in the embedding space,” IEEE Trans. Inf. Forensics Security, vol. 16, pp. 1579–1593, 2021.
[11]
A. Chowdhury and A. Ross, “Fusing MFCC and LPC features using 1D triplet CNN for speaker recognition in severely degraded audio signals,” IEEE Trans. Inf. Forensics Security, vol. 15, pp. 1616–1629, 2019.
[12]
S. Wang, Z. Zhang, G. Zhu, X. Zhang, Y. Zhou, and J. Huang, “Query-efficient adversarial attack with low perturbation against end-to-end speech recognition systems,” IEEE Trans. Inf. Forensics Security, vol. 18, pp. 351–364, 2023.
[13]
C. Xue, X. Zhong, M. Cai, H. Chen, and W. Wang, “Audio-visual event localization by learning spatial and semantic co-attention,” IEEE Trans. Multimedia, vol. 25, pp. 418–429, 2023.
[14]
A. Greco, N. Petkov, A. Saggese, and M. Vento, “AReN: A deep learning approach for sound event recognition using a brain inspired representation,” IEEE Trans. Inf. Forensics Security, vol. 15, pp. 3610–3624, 2020.
[15]
A. Nagrani, S. Albanie, and A. Zisserman, “Seeing voices and hearing faces: Cross-modal biometric matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8427–8436.
[16]
R. Wang, X. Liu, Y.-M. Cheung, K. Cheng, N. Wang, and W. Fan, “Learning discriminative joint embeddings for efficient face and voice association,” in Proc. 43rd Int. ACM SIGIR Conf. Res. Develop. Inf. Retr., Jul. 2020, pp. 1881–1884.
[17]
S. Nawaz, M. K. Janjua, I. Gallo, A. Mahmood, and A. Calefati, “Deep latent space learning for cross-modal mapping of audio and visual signals,” in Proc. Digit. Image Computing: Techn. Appl. (DICTA), Dec. 2019, pp. 1–7.
[18]
Y. Wen, M. A. Ismail, W. Liu, B. Raj, and R. Singh, “Disjoint mapping network for cross-modal matching of voices and faces,” in Proc. Int. Conf. Learn. Represent., 2019, pp. 1–13.
[19]
K. Cheng, X. Liu, Y.-M. Cheung, R. Wang, X. Xu, and B. Zhong, “Hearing like seeing: Improving voice-face interactions and associations via adversarial deep semantic matching network,” in Proc. 28th ACM Int. Conf. Multimedia, Oct. 2020, pp. 448–455.
[20]
A. Zheng, M. Hu, B. Jiang, Y. Huang, Y. Yan, and B. Luo, “Adversarial-metric learning for audio-visual cross-modal matching,” IEEE Trans. Multimedia, vol. 24, pp. 338–351, 2022.
[21]
J. Wang, C. Li, A. Zheng, J. Tang, and B. Luo, “Looking and hearing into details: Dual-enhanced Siamese adversarial network for audio-visual matching,” IEEE Trans. Multimedia, vol. 25, pp. 7505–7516, 2022. 10.1109/TMM.2022.3222936.
[22]
I. Goodfellowet al., “Generative adversarial nets,” in Proc. Adv. Neural Inf. Process. Syst., vol. 27, 2014, pp. 1–9.
[23]
M. S. Saeed, M. H. Khan, S. Nawaz, M. H. Yousaf, and A. D. Bue, “Fusion and orthogonal projection for improved face-voice association,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), May 2022, pp. 7057–7061.
[24]
H. Ning, X. Zheng, X. Lu, and Y. Yuan, “Disentangled representation learning for cross-modal biometric matching,” IEEE Trans. Multimedia, vol. 24, pp. 1763–1774, 2022.
[25]
P. Wen, Q. Xu, Y. Jiang, Z. Yang, Y. He, and Q. Huang, “Seeking the shape of sound: An adaptive framework for learning voice-face association,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 16342–16351.
[26]
X. Wei, T. Zhang, Y. Li, Y. Zhang, and F. Wu, “Multi-modality cross attention network for image and sentence matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 10941–10950.
[27]
Y. Gao, O. Beijbom, N. Zhang, and T. Darrell, “Compact bilinear pooling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 317–326.
[28]
A. Andonian, S. Chen, and R. Hamid, “Robust cross-modal representation learning with progressive self-distillation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 16409–16420.
[29]
M. Tao, H. Tang, F. Wu, X. Jing, B.-K. Bao, and C. Xu, “DF-GAN: A simple and effective baseline for text-to-image synthesis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 16494–16504.
[30]
A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A large-scale speaker identification dataset,” in Proc. Interspeech, Aug. 2017, pp. 2616–2620.
[31]
O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in Proc. Brit. Mach. Vis. Conf., 2015, pp. 1–12.
[32]
A. Nagrani, S. Albanie, and A. Zisserman, “Learnable PINs: Cross-modal embeddings for person identity,” in Proc. Eur. Conf. Comput. Vis. (ECCV), Sep. 2018, pp. 71–88.
[33]
H.-S. Choi, C. Park, and K. Lee, “From inference to generation: End-to-end fully self-supervised generation of human face from speech,” in Proc. Int. Conf. Learn. Represent., 2020, pp. 1–18.
[34]
X. Tuet al., “Image-to-video generation via 3D facial dynamics,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 4, pp. 1805–1819, May 2021.
[35]
J. Sun, Q. Li, W. Wang, J. Zhao, and Z. Sun, “Multi-caption text-to-face synthesis: Dataset and algorithm,” in Proc. 29th ACM Int. Conf. Multimedia, Oct. 2021, pp. 2290–2298.
[36]
J. Liet al., “Integrated face analytics networks through cross-dataset hybrid training,” in Proc. 25th ACM Int. Conf. Multimedia, Oct. 2017, pp. 1531–1539.
[37]
J. Zhao, J. Li, Y. Cheng, T. Sim, S. Yan, and J. Feng, “Understanding humans in crowded scenes: Deep nested adversarial learning and a new benchmark for multi-human parsing,” in Proc. Int. Conf. Multimedia, 2018, pp. 792–800.
[38]
J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), vol. 34, 2021, pp. 9694–9705.
[39]
A. Zhenget al., “Progressive attribute embedding for accurate cross-modality person Re-ID,” in Proc. 30th ACM Int. Conf. Multimedia, Oct. 2022, pp. 4309–4317.
[40]
K. Zhang, Z. Mao, Q. Wang, and Y. Zhang, “Negative-aware attention framework for image-text matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 15661–15670.
[41]
H. Xuan, Z. Zhang, S. Chen, J. Yang, and Y. Yan, “Cross-modal attention network for temporal inconsistent audio-visual event localization,” in Proc. AAAI, 2020, pp. 279–286.
[42]
S. Liu, W. Quan, C. Wang, Y. Liu, B. Liu, and D.-M. Yan, “Dense modality interaction network for audio-visual event localization,” IEEE Trans. Multimedia, vol. 25, pp. 2734–2748, 2022. 10.1109/TMM.2022.3150469.
[43]
R. G. Praveenet al., “A joint cross-attention model for audio-visual fusion in dimensional emotion recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2022, pp. 2486–2495.
[44]
H. Zhou, J. Du, Y. Zhang, Q. Wang, Q.-F. Liu, and C.-H. Lee, “Information fusion in attention networks using adaptive and multi-level factorized bilinear pooling for audio-visual emotion recognition,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 29, pp. 2617–2629, 2021.
[45]
Z. Ji, H. Wang, J. Han, and Y. Pang, “SMAN: Stacked multimodal attention network for cross-modal image–text retrieval,” IEEE Trans. Cybern., vol. 52, no. 2, pp. 1086–1097, Feb. 2022.
[46]
Q. Zhang, Z. Lei, Z. Zhang, and S. Z. Li, “Context-aware attention network for image-text retrieval,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 3536–3545.
[47]
Y. Cheng, R. Wang, Z. Pan, R. Feng, and Y. Zhang, “Look, listen, and attend: Co-attention network for self-supervised audio-visual representation learning,” in Proc. 28th ACM Int. Conf. Multimedia, Oct. 2020, pp. 3884–3892.
[48]
O.-B. Mercea, L. Riesch, A. S. Koepke, and Z. Akata, “Audiovisual generalised zero-shot learning with cross-modal attention and language,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 10543–10553.
[49]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, vol. 16, 2016, pp. 770–778.
[50]
X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” in Proc. Int. Conf. Learn. Represent., 2022, pp. 1–21.
[51]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in Proc. CVPR, Jun. 2009, pp. 248–255.
[52]
X. Pan, P. Luo, J. Shi, and X. Tang, “Two at once: Enhancing learning and generalization capacities via IBN-Net,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 464–479.
[53]
J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE/CVF Conf Comput. Vis. Pattern Recognit. (CVPR), Jun. 2018, pp. 7132–7141.
[54]
P. Dai, R. Ji, H. Wang, Q. Wu, and Y. Huang, “Cross-modality person re-identification with generative adversarial training,” in Proc. 27th Int. Joint Conf. Artif. Intell., Jul. 2018, p. 6.
[55]
Y. Peng and J. Qi, “CM-GANs: Cross-modal generative adversarial networks for common representation learning,” ACM Trans. Multimedia Comput., Commun., Appl., vol. 15, no. 1, pp. 1–24, Feb. 2019.
[56]
D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” 2014, arXiv:1412.6980.
[57]
Z. Yuet al., “Detach and enhance: Learning disentangled cross-modal latent representation for efficient face-voice association and matching,” in Proc. IEEE Int. Conf. Data Mining (ICDM), Nov. 2022, pp. 648–655.
[58]
A. Vaswaniet al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–11.
[59]
S. Mehta and M. Rastegari, “MobileViT: Light-weight, general-purpose, and mobile-friendly vision transformer,” 2022, arXiv:2110.02178.
[60]
P. K. A. Vasu, J. Gabriel, J. Zhu, O. Tuzel, and A. Ranjan, “FastViT: A fast hybrid vision transformer using structural reparameterization,” in Proc. IEEE Int. Conf. Comput. Vis., Oct. 2023, pp. 5785–5795.
[61]
B. Liet al., “DropKey for vision transformer,” in Proc. Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 22700–22709.
[62]
S. Tanget al., “RobustART: Benchmarking robustness on architecture design and training techniques,” 2021, arXiv:2109.05211.
[63]
B. Zhuet al., “Unsupervised voice-face representation learning by cross-modal prototype contrast,” in Proc. 31st Int. Joint Conf. Artif. Intell., Jul. 2022, pp. 3787–3794.
[64]
M. S. Saeedet al., “Single-branch network for multimodal training,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), Jun. 2023, pp. 1–5.
[65]
G. Chen, D. Zhang, T. Liu, and X. Du, “EFT: Expert fusion transformer for voice-face association learning,” in Proc. IEEE Int. Conf. Multimedia Expo (ICME), Jul. 2023, pp. 2603–2608.
[66]
Y. Wei, D. Hu, Y. Tian, and X. Li, “Learning in audio-visual context: A review, analysis, and new perspective,” 2022, arXiv:2208.09579.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Information Forensics and Security
IEEE Transactions on Information Forensics and Security  Volume 19, Issue
2024
10342 pages

Publisher

IEEE Press

Publication History

Published: 15 April 2024

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 16 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media