[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3123266.3123313acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Enhancing Micro-video Understanding by Harnessing External Sounds

Published: 19 October 2017 Publication History

Abstract

Different from traditional long videos, micro-videos are much shorter and usually recorded at a specific place with mobile devices. To better understand the semantics of a micro-video and facilitate downstream applications, it is crucial to estimate the venue where the micro-video is recorded, for example, in a concert or on a beach. However, according to our statistics over two million micro-videos, only $1.22%$ of them were labeled with location information. For the remaining large number of micro-videos without location information, we have to rely on their content to estimate their venue categories. This is a highly challenging task, as micro-videos are naturally multi-modal (with textual, visual and, acoustic content), and more importantly, the quality of each modality varies greatly for different micro-videos.
In this work, we focus on enhancing the acoustic modality for the venue category estimation task. This is motivated by our finding that although the acoustic signal can well complement the visual and textual signal in reflecting a micro-video's venue, its quality is usually relatively lower. As such, simply integrating acoustic features with visual and textual features only leads to suboptimal results, or even adversely degrades the overall performance (cf the barrel theory). To address this, we propose to compensate the shortest board --- the acoustic modality --- via harnessing the external sound knowledge. We develop a deep transfer model which can jointly enhance the concept-level representation of micro-videos and the venue category prediction. To alleviate the sparsity problem of unpopular categories, we further regularize the representation learning of micro-videos of the same venue category. Through extensive experiments on a real-world dataset, we show that our model significantly outperforms the state-of-the-art method in terms of both Micro-F1 and Macro-F1 scores by leveraging the external acoustic knowledge.

References

[1]
Khalid Ashraf, Benjamin Elizalde, Forrest Iandola, Matthew Moskewicz, Julia Bernd, Gerald Friedland, and Kurt Keutzer. 2015. Audio-based multimedia event detection with DNNs and sparse sampling ICMR. 611--614.
[2]
Soheil Bahrampour, Nasser M Nasrabadi, Asok Ray, and William Kenneth Jenkins. 2016. Multimodal task-driven dictionary learning for image classification. TIP, Vol. 25, 1 (2016), 24--38.
[3]
Susanne Burger, Qin Jin, Peter F Schulam, and Florian Metze. 2012. Noisemes: Manual annotation of environmental noise in audio streams. Technical report Carnegie Mellon University-LTI-12-07 (2012), 1--5.
[4]
Song Cao and Noah Snavely. 2013. Graph-based discriminative learning for location recognition CVPR. 700--707.
[5]
Diego Castan and Murat Akbacak. 2013. Segmental-GMM Approach based on Acoustic Concept Segmentation SLAM@ INTERSPEECH. 15--19.
[6]
Sourish Chaudhuri and Bhiksha Raj. 2012. Unsupervised structure discovery for semantic analysis of audio NIPS. 1178--1186.
[7]
Jingyuan Chen, Xuemeng Song, Liqiang Nie, Xiang Wang, Hanwang Zhang, and Tat-Seng Chua. 2016. Micro Tells Macro: Predicting the Popularity of Micro-Videos via a Transductive Model MM. 898--907.
[8]
Ning Chen, Jun Zhu, and Eric P Xing. 2010. Predictive subspace learning for multi-view data: a large margin approach NIPS. 361--369.
[9]
Jaeyoung Choi, Gerald Friedland, Venkatesan Ekambaram, and Kannan Ramchandran. 2012. Multimodal location estimation of consumer media: Dealing with sparse training data ICME. 43--48.
[10]
Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition ICML. 647--655.
[11]
M. Elad and M. Aharon. 2006. Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries. TIP, Vol. 15, 12 (2006), 3736--3745.
[12]
Fuli Feng, Liqiang Nie, Xiang Wang, Richang Hong, and Tat-Seng Chua. 2017. Computational social indicators: a case study of Chinese university ranking SIGIR.
[13]
Gerald Friedland, Jaeyoung Choi, Howard Lei, and Adam Janin. 2011. Multimodal location estimation on Flickr videos. MM. 23--28.
[14]
Siddharth Gopal and Yiming Yang. 2013. Recursive Regularization for Large-scale Classification with Hierarchical and Graphical Dependencies. In SIGKDD. 257--265.
[15]
James Hays and Alexei A Efros. 2008. IM2GPS: estimating geographic information from a single image CVPR. 1--8.
[16]
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. In WWW.
[17]
Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Chua Tat-Seng. 2016. Fast Matrix Factorization for Online Recommendation with Implicit Feedback SIGIR.
[18]
Adam Kilgarriff and Christiane Fellbaum. 2000. WordNet: An Electronic Lexical Database. (2000).
[19]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks NIPS. 1106--1114.
[20]
Anan Liu, Weizhi Nie, Yue Gao, and Yuting Su. 2016. Multi-Modal Clique-Graph Matching for View-Based 3D Model Retrieval. TIP, Vol. 25, 5 (2016), 2103--2116.
[21]
Anan Liu, Yuting Su, Weizhi Nie, and Mohan S. Kankanhalli. 2017. Hierarchical Clustering Multi-Task Learning for Joint Human Action Grouping and Recognition. TPAMI, Vol. 39, 1 (2017), 102--114.
[22]
Gaowen Liu, Yan Yan, Elisa Ricci, Yi Yang, Yahong Han, Stefan Winkler, and Nicu Sebe. 2015. Inferring Painting Style with Multi-task Dictionary Learning IJCAI. 2162--2168.
[23]
Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. 2015. Learning Transferable Features with Deep Adaptation Networks ICML. 97--105.
[24]
J. Mairal, F. Bach, and J. Ponce. 2012 a. Task-Driven Dictionary Learning. TPAMI, Vol. 34, 4 (2012), 791--804.
[25]
Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. 2009. Online Dictionary Learning for Sparse Coding. In ICML. 689--696.
[26]
Julien Mairal, Francis R. Bach, and Jean Ponce. 2012 b. Task-Driven Dictionary Learning. TPAMI, Vol. 34, 4 (2012), 791--804.
[27]
Julien Mairal, Michael Elad, and Guillermo Sapiro. 2008. Sparse representation for color image restoration. TIP, Vol. 17, 1 (2008), 53--69.
[28]
Julien Mairal, Jean Ponce, Guillermo Sapiro, Andrew Zisserman, and Francis R. Bach. 2009. Supervised Dictionary Learning. NIPS. 1033--1040.
[29]
Annamaria Mesaros, Toni Heittola, Antti Eronen, and Tuomas Virtanen. 2010 a. Acoustic event detection in real life recordings. EUSIPCO. 1267--1271.
[30]
Annamaria Mesaros, Toni Heittola, Antti J. Eronen, and Tuomas Virtanen. 2010 b. Acoustic event detection in real life recordings. EUSIPCO. 1267--1271.
[31]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality NIPS. 3111--3119.
[32]
Gianluca Monaci, Philippe Jost, Pierre Vandergheynst, Boris Mailhe, Sylvain Lesage, and Rémi Gribonval. 2007. Learning multimodal dictionaries. TIP, Vol. 16, 9 (2007), 2272--2283.
[33]
Stephanie Lynne Pancoast, Murat Akbacak, and Michelle Hewlett Sanchez. 2012. Supervised acoustic concept extraction for multimedia event detection Proceedings of the 2012 ACM international workshop on Audio and multimedia methods for large-scale video analysis. ACM, 9--14.
[34]
Mirco Ravanelli, Benjamin Elizalde, Karl Ni, and Gerald Friedland. 2014. Audio concept classification with hierarchical deep neural networks EUSIPCO. 606--610.
[35]
S. Sadanand and J. J. Corso. 2012. Action bank: A high-level representation of activity in video CVPR. 1234--1241.
[36]
Xuemeng Song, Liqiang Nie, Luming Zhang, Mohammad Akbari, and Tat-Seng Chua. 2015. Multiple social network learning and its application in volunteerism tendency prediction SIGIR. 213--222.
[37]
Meng Wang, Xian-Sheng Hua, Richang Hong, Jinhui Tang, Guo-Jun Qi, and Yan Song. 2009. Unified video annotation via multigraph learning. TCSVT, Vol. 19, 5 (2009), 733--746.
[38]
Meng Wang, Hao Li, Dacheng Tao, Ke Lu, and Xindong Wu. 2012. Multimodal graph-based reranking for web image search. TIP, Vol. 21, 11 (2012), 4649--4661.
[39]
Xiang Wang, Xiangnan He, Liqiang Nie, and Tat-Seng Chua. 2017. Item Silk Road: Recommending Items from Information Domains to Social Users. (2017).
[40]
Xiang Wang, Liqiang Nie, Xuemeng Song, Dongxiang Zhang, and Tat-Seng Chua. 2017. Unifying virtual and physical worlds: Learning toward local and global consistency. TOIS, Vol. 36, 1 (2017), 4.
[41]
Yipei Wang, Shourabh Rawat, and Florian Metze. 2014. Exploring audio semantic concepts for event-based video retrieval ICASSP. 1360--1364.
[42]
Meng Yang, Weiyang Liu, Weixin Luo, and Linlin Shen. 2016 b. Analysis-Synthesis Dictionary Learning for Universality-Particularity Representation Based Classification. In AAAI. 2251--2257.
[43]
Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. 2016 a. Revisiting Semi-Supervised Learning with Graph Embeddings ICML. 40--48.
[44]
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transferable are features in deep neural networks? NIPS. 3320--3328.
[45]
Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. 2017. Visual Translation Embedding Network for Visual Relation Detection CVPR.
[46]
Hanwang Zhang, Zheng-Jun Zha, Yang Yang, Shuicheng Yan, and Tat-Seng Chua. 2014. Robust (semi) nonnegative graph embedding. TIP, Vol. 23, 7 (2014), 2996--3012.
[47]
Jianglong Zhang, Liqiang Nie, Xiang Wang, Xiangnan He, Xianglin Huang, and Tat-Seng Chua. 2016. Shorter-is-Better: Venue Category Estimation from Micro-Video MM. 1415--1424.
[48]
Yueting Zhuang, Yanfei Wang, Fei Wu, Yin Zhang, and Weiming Lu. 2013. Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval AAAI. 1070--1076.

Cited By

View all
  • (2024)An Underwater Multi-Label Classification Algorithm Based on a Bilayer Graph Convolution Learning Network with Constrained CodecElectronics10.3390/electronics1316313413:16(3134)Online publication date: 7-Aug-2024
  • (2024)Multimodal Attentive Representation Learning for Micro-video Multi-label ClassificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364388820:6(1-23)Online publication date: 8-Mar-2024
  • (2024)Query-Oriented Micro-Video SummarizationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.335540246:6(4174-4187)Online publication date: Jun-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '17: Proceedings of the 25th ACM international conference on Multimedia
October 2017
2028 pages
ISBN:9781450349062
DOI:10.1145/3123266
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. deep neural network
  2. external sound knowledge
  3. micro-video categorization
  4. representation learning

Qualifiers

  • Research-article

Funding Sources

  • 1000 plan

Conference

MM '17
Sponsor:
MM '17: ACM Multimedia Conference
October 23 - 27, 2017
California, Mountain View, USA

Acceptance Rates

MM '17 Paper Acceptance Rate 189 of 684 submissions, 28%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)2
Reflects downloads up to 29 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)An Underwater Multi-Label Classification Algorithm Based on a Bilayer Graph Convolution Learning Network with Constrained CodecElectronics10.3390/electronics1316313413:16(3134)Online publication date: 7-Aug-2024
  • (2024)Multimodal Attentive Representation Learning for Micro-video Multi-label ClassificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364388820:6(1-23)Online publication date: 8-Mar-2024
  • (2024)Query-Oriented Micro-Video SummarizationIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2024.335540246:6(4174-4187)Online publication date: Jun-2024
  • (2024)SADCMF: Self-Attentive Deep Consistent Matrix Factorization for Micro-Video Multi-Label ClassificationIEEE Transactions on Multimedia10.1109/TMM.2024.340619626(10331-10341)Online publication date: 2024
  • (2024)Multimodal Progressive Modulation Network for Micro-Video Multi-Label ClassificationIEEE Transactions on Multimedia10.1109/TMM.2024.340572426(10134-10144)Online publication date: 2024
  • (2024)Dual-Domain Aligned Deep Hierarchical Matrix Factorization Method for Micro-Video Multi-Label ClassificationIEEE Transactions on Multimedia10.1109/TMM.2023.330122426(2598-2607)Online publication date: 2024
  • (2024)Enhancing Micro-Video Venue Recognition via Multi-Modal and Multi-Granularity Object RelationsIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2023.334920234:7(5440-5451)Online publication date: Jul-2024
  • (2024)Deep Matrix Factorization With Complementary Semantic Aggregation for Micro-Video Multi-Label ClassificationIEEE Signal Processing Letters10.1109/LSP.2023.334009731(1685-1689)Online publication date: 2024
  • (2024)Multimodal semantic enhanced representation network for micro-video event detectionKnowledge-Based Systems10.1016/j.knosys.2024.112255301(112255)Online publication date: Oct-2024
  • (2024)Demsasa: micro-video scene classification based on denoising multi-shots association self-attentionPattern Analysis and Applications10.1007/s10044-024-01378-627:4Online publication date: 29-Nov-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media