[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3617695.3617728acmotherconferencesArticle/Chapter ViewAbstractPublication PagesbdiotConference Proceedingsconference-collections
research-article

An Attention-based Audio-visual Fusion Method for Short Video Classification

Published: 02 November 2023 Publication History

Abstract

The short multimedia video has become one of the most representative products in the new media era. Because short videos are highly time-sensitive, it is essential to manage and classify them efficiently. Currently, the primary approaches for short video classification are to extract the image features and then make judgments to complete the classification. However, audio information, crucial for classification tasks, is often discarded or used separately. To this end, we propose an attention-based audio-visual fusion method for short video classification. In this method, the attention module calculates the magnitude of the influence of image and audio information on the classification result, and the image and audio information are fused for the short video classification. The experimental results on different short multimedia video datasets demonstrate that the proposed attention-based audio-visual fusion method is effective and can significantly improve the classification accuracy of short videos.

References

[1]
Çelik, Ö. 2018. A research on machine learning methods and its applications. Journal of Educational Technology and Online Learning, 1(3): 25-40
[2]
Sundhararajan, M., Gao, X.-Z., Vahdat Nejad, H. 2018. Artificial intelligent techniques and its applications. Journal of Intelligent & Fuzzy Systems, 34(2): 755-760. http://10.3233/JIFS-169369
[3]
Coşkun, M., YILDIRIM, Ö., Ayşegül, U. and Demir, Y. 2017. An overview of popular deep learning methods. European Journal of Technique (EJT), 7(2): 165-176
[4]
Hospedales, T., Antoniou, A., Micaelli, P. and Storkey, A. 2021. Meta-learning in neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(9): 5149-5169. http://10.1109/TPAMI.2021.3079209
[5]
Xiangrong, Z. and Fang, L. 2002. A pattern classification method based on GA and SVM. Paper presented at the 6th International Conference on Signal Processing, 2002.
[6]
Tu, T., Hong, H., Van, D., Van, T. and Duy, A. 2016. An Improvement of K-Means Algorithm Using Wavelet Technique to Increase Speed of Clustering Remote Sensing Images. International Journal of Computer and Electrical Engineering, 8(2): 177-184
[7]
Blondel, P., Potelle, A., Pégard, P. C. and Lozano, P. R. 2013. How to improve the HOG detector in the UAV context. IFAC Proceedings Volumes, 46(30): 46-51. http://10.3182/20131120-3-FR-4045.00009
[8]
Tomasi, C. 2012. Histograms of oriented gradients. Computer Vision Sampler: 1-6
[9]
Long, X., Gan, C., De Melo, G., Wu, J., Liu, X. and Wen, S. 2018. Attention clusters: Purely attention based local feature integration for video classification. Paper presented at the Proceedings of the IEEE conference on computer vision and pattern recognition. http://10.1109/CVPR.2018.00817
[10]
Ma, C.-Y., Chen, M.-H., Kira, Z. and AlRegib, G. 2019. TS-LSTM and temporal-inception: Exploiting spatiotemporal dynamics for activity recognition. Signal Processing: Image Communication, 71: 76-87. http://10.1016/j.image.2018.09.003
[11]
Tran, D., Bourdev, L., Fergus, R., Torresani, L. and Paluri, M. 2015. Learning spatiotemporal features with 3d convolutional networks. Paper presented at the Proceedings of the IEEE international conference on computer vision. http://10.1109/ICCV.2015.510
[12]
Carreira, J. and Zisserman, A. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. Paper presented at the proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[13]
Hara, K., Kataoka, H. and Satoh, Y. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? Paper presented at the Proceedings of the IEEE conference on Computer Vision and Pattern Recognition.
[14]
Opitz, J. and Burst, S. 2019. Macro f1 and macro f1. arXiv preprint arXiv:1911.03347
[15]
Lin, T.-Y., Goyal, P., Girshick, R., He, K. and Dollár, P. 2017. Focal loss for dense object detection. Paper presented at the Proceedings of the IEEE international conference on computer vision. http://10.1109/TPAMI.2018.2858826
[16]
Zhang, X., Yang, W., Tang, X. and Liu, J. 2018. A fast learning method for accurate and robust lane detection using two-stage feature extraction with YOLO v3. Sensors, 18(12): 4308. http://10.3390/s18124308
[17]
Jiang, Y.-G., Ye, G., Chang, S.-F., Ellis, D. and Loui, A. C. 2011. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. Paper presented at the Proceedings of the 1st ACM International Conference on Multimedia Retrieval. http://10.1145/1991996.1992025
[18]
Fonseca, E., Plakal, M., Font, F., Ellis, D. P. and Serra, X. 2019. Audio tagging with noisy labels and minimal supervision. arXiv preprint arXiv:1906.02975
[19]
Lin, M., Chen, Q. and Yan, S. 2013. Network in network. arXiv preprint arXiv:1312.4400
[20]
Feichtenhofer, C., Fan, H., Malik, J. and He, K. 2019. Slowfast networks for video recognition. Paper presented at the Proceedings of the IEEE/CVF international conference on computer vision. http://10.1109/ICCV.2019.00630
[21]
Li, Y., Wu, C.-Y., Fan, H., Mangalam, K., Xiong, B., Malik, J. and Feichtenhofer, C. 2021. Improved multiscale vision transformers for classification and detection. arXiv preprint arXiv:2112.01526. http://10.1109/CVPR52688.2022.00476

Index Terms

  1. An Attention-based Audio-visual Fusion Method for Short Video Classification

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    BDIOT '23: Proceedings of the 2023 7th International Conference on Big Data and Internet of Things
    August 2023
    232 pages
    ISBN:9798400708015
    DOI:10.1145/3617695
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 November 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    BDIOT 2023

    Acceptance Rates

    Overall Acceptance Rate 75 of 136 submissions, 55%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 28
      Total Downloads
    • Downloads (Last 12 months)24
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media