Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classification

Wei Liu^1,2,
Xianglin Huang¹,
Gang Cao¹,
Jianglong Zhang³,
Gege Song¹ &
…
Lifang Yang¹

410 Accesses
Explore all metrics

Abstract

With the large amount of micro-videos available in social network applications, micro-video venue category provides extremely valuable venue information that assists location-oriented applications, personalized services, etc. In this paper, we formulate micro-video venue classification as a multi-modal sequential modeling problem. Unlike existing approaches that use long short-term memory (LSTM) models to capture temporal patterns for micro-video, we propose multi-modality sequence model with gated fully convolutional blocks. Specifically, we firstly adopt three parallel gated fully convolutional blocks to extract spatiotemporal features from visual, acoustic and textual modalities of micro-videos. Then, an additional gated fully convolutional block is used to fuse such three modalities of spatiotemporal features. Finally, corresponding prototype is simultaneously learned to improve the robustness against softmax classification function. Extensive experimental results on a real-world benchmark dataset demonstrate the effectiveness of our model in terms of both Micro-F and Macro-F scores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Joint Learning of LSTMs-CNN and Prototype for Micro-video Venue Classification

Attention-enhanced joint learning network for micro-video venue classification

Article 01 July 2023

Context-aware focal alignment network for micro-video multi-label classification

Article 14 November 2024

Notes

References

Bai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271
Bengio Y, Simard P, Frasconi P (2002) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166
Article Google Scholar
Cao G, Zhao Y, Ni R, Li X (2014) Contrast enhancement-based forensics in digital images. IEEE Trans Inf Forensics Secur 9(3):515–525
Article Google Scholar
Chen J (2016) Multi-modal learning: Study on a large-scale micro-video data collection. In: ACM on multimedia conference, pp 1454–1458
Chen J, Song X, Nie L, Wang X, Zhang H, Chua TS (2016) Micro tells macro: Predicting the popularity of micro-videos via a transductive model. In: ACM on multimedia conference, pp 898–907
Chenggang Y, Yunbin T, Xingzheng W, Yongbing Z, Xinhong H, Yongdong Z, Qionghai D (2019) Stat: Spatial-temporal attention mechanism for video captioning. IEEE transactions on multimedia
Cho K, Merrienboer BV, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: Encoder-decoder approaches. Computer Science
Donahue J, Hendricks LA, Rohrbach M, Venugopalan S, Guadarrama S, Saenko K, Darrell T (2017) Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans Pattern Anal Mach Intell 39(4):677–691
Article Google Scholar
Feng Y, Ma L, Liu W, Luo J (2019) Spatio-temporal video re-localization by warp lstm. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Gehring J, Auli M, Grangier D, Yarats D, Dauphin YN (2017) Convolutional sequence to sequence learning. arXiv:1705.03122
Guo J, Nie X, Cui C, Xi X, Ma Y, Yin Y (2018) Getting more from one attractive scene: Venue retrieval in micro-videos. In: Advances in multimedia information processing - PCM 2018 - Part I, pp 721–733
Hays J, Efros AA (2008) Im2gps: Estimating geographic information from a single image. In: IEEE Conference on computer vision and pattern recognition, pp 1–8
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Huang L, Luo B (2017) Tag refinement of micro-videos by learning from multiple data sources. Multimed Tools Appl 76(3):1–18
Google Scholar
Jing P, Su Y, Nie L, Bai X, Liu J, Wang M (2018) Low-rank multi-view embedding learning for micro-video popularity prediction. IEEE Trans Knowl Data Eng PP(99):1519–1532
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: International conference on neural information processing systems, pp 1106–1114
Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324
Article Google Scholar
Lepri B, Mana N, Cappelletti A, Pianesi F (2009) Automatic prediction of individual performance from “thin slices” of social behavior. In: Proceedings of the 17th international conference on multimedia 2009, pp 733–736
Li Y, Yao T, Mei T, Chao H, Rui Y (2016) Share-and-chat: Achieving human-level video commenting by search and multi-view embedding. In: ACM on multimedia conference, pp 928–937
Liu M, Nie L, Wang M, Chen B (2017) Towards micro-video understanding by joint sequential-sparse modeling. In: ACM on multimedia conference, pp 970–978
Liu W, Huang X, Cao G, Song G, Yang L (2018) Joint learning of lstms-cnn and prototype for micro-video venue classification. In: Advances in multimedia information processing - PCM 2018 - Part II, pp 705–715
Luo W, Liu W, Gao S (2017) Remembering history with convolutional lstm for anomaly detection. In: 2017 IEEE International conference on multimedia and expo (ICME). IEEE, pp 439–444
Miech A, Laptev I, Sivic J (2017) Learnable pooling with context gating for video classification. arXiv:1706.06905
Mikolov T, Sutskever I, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Proces Syst 26:3111–3119
Google Scholar
Nguyen PX, Rogez G, Fowlkes C, Ramanan D (2016) The open world of micro-videos. arXiv:1603.09439
Nie L, Wang X, Zhang J, He X, Zhang H, Hong R, Tian Q (2017) Enhancing micro-video understanding by harnessing external sounds. In: ACM on multimedia conference, pp 1192–1200
Redi M, Hare NO, Schifanella R, Trevisiol M, Jaimes A (2014) 6 seconds of sound and vision: Creativity in micro-videos. In: Computer vision and pattern recognition, pp 4272–4279
Rochan M, Ye L, Wang Y (2018) Video summarization using fully convolutional sequence networks. In: Computer vision - ECCV 2018 - 15th european conference, Munich, Germany, September 8-14, 2018, proceedings, Part XII, pp 358–374
Sanden C, Zhang JZ (2011) Enhancing multi-label music genre classification through ensemble techniques. In: Proceeding of the 34th international ACM SIGIR conference on research and development in information retrieval, SIGIR 2011, pp 705–714
Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11):2673–2681
Article Google Scholar
Shi X, Chen Z, Wang H, Yeung D, Wong W, Woo W (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems 2015, pp 802–810
Song S, Huang H, Ruan T (2018) Abstractive text summarization using lstm-cnn based deep learning. Multimed Tools Appl 78(10):1–19
Google Scholar
Xu K, Wen L, Li G, Bo L, Huang Q (2019) Spatiotemporal cnn for video object segmentation. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Yan C, Li L, Zhang C, Liu B, Dai Q (2019) Cross-modality bridging and knowledge transferring for image understanding. IEEE Transactions on Multimedia
Yan C, Xie H, Chen J, Zha Z, Hao X, Zhang Y, Dai Q (2018) A fast uyghur text detector for complex background images. IEEE Trans Multimedia 20 (12):3389–3398
Article Google Scholar
Yang H, Zhang X, Yin F, Liu C (2018) Robust classification with convolutional prototype learning. In: IEEE conference on computer vision and pattern recognition
Ye M, Yin P, Lee WC (2010) Location recommendation for location-based social networks. In: ACM sigspatial international symposium on advances in geographic information systems, acm-gis 2010, November 3-5, 2010, San Jose, CA, USA, proceedings, pp 458–461
Yue Z, Qi L, Song L (2018) Sentence-state lstm for text representation. In: Proceedings of the 56th annual meeting of the association for computational linguistics (ACL)
Zhang J, Nie L, Wang X, He X, Huang X, Chua TS (2016) Shorter-is-better: Venue category estimation from micro-video. In: ACM on multimedia conference, pp 1415–1424
Zhao B, Li X, Lu X (2018) Hsa-rnn: Hierarchical structure-adaptive rnn for video summarization. In: The IEEE conference on computer vision and pattern recognition (CVPR)
Zhu L, Huang Z, Liu X, He X, Sun J, Zhou X (2017) Discrete multi-modal hashing with canonical views for robust mobile landmark search. IEEE Trans Multimedia 19(9):2066–2079
Article Google Scholar

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (61401408, 61772539), and the Fundamental Research Funds for the Central Universities (CUC2019B021).

Author information

Authors and Affiliations

School of Computer Science and Cybersecurity, Communication University of China, Beijing, China
Wei Liu, Xianglin Huang, Gang Cao, Gege Song & Lifang Yang
Nanyang Institute of Technology, Nanyang, China
Wei Liu
State Grid Fujian Information and Telecommunication Company, Fuzhou, China
Jianglong Zhang

Authors

Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xianglin Huang
View author publications
You can also search for this author in PubMed Google Scholar
Gang Cao
View author publications
You can also search for this author in PubMed Google Scholar
Jianglong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Gege Song
View author publications
You can also search for this author in PubMed Google Scholar
Lifang Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xianglin Huang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, W., Huang, X., Cao, G. et al. Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classification. Multimed Tools Appl 79, 6709–6726 (2020). https://doi.org/10.1007/s11042-019-08147-2

Download citation

Received: 27 December 2018
Revised: 01 July 2019
Accepted: 02 September 2019
Published: 17 December 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s11042-019-08147-2

Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Joint Learning of LSTMs-CNN and Prototype for Micro-video Venue Classification

Attention-enhanced joint learning network for micro-video venue classification

Context-aware focal alignment network for micro-video multi-label classification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Multi-modal sequence model with gated fully convolutional blocks for micro-video venue classification

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Joint Learning of LSTMs-CNN and Prototype for Micro-video Venue Classification

Attention-enhanced joint learning network for micro-video venue classification

Context-aware focal alignment network for micro-video multi-label classification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now