research-article

Enhancing Micro-video Understanding by Harnessing External Sounds

Authors:

Qi TianAuthors Info & Claims

MM '17: Proceedings of the 25th ACM international conference on Multimedia

Pages 1192 - 1200

https://doi.org/10.1145/3123266.3123313

Published: 19 October 2017 Publication History

Get Access

Abstract

Different from traditional long videos, micro-videos are much shorter and usually recorded at a specific place with mobile devices. To better understand the semantics of a micro-video and facilitate downstream applications, it is crucial to estimate the venue where the micro-video is recorded, for example, in a concert or on a beach. However, according to our statistics over two million micro-videos, only $1.22%$ of them were labeled with location information. For the remaining large number of micro-videos without location information, we have to rely on their content to estimate their venue categories. This is a highly challenging task, as micro-videos are naturally multi-modal (with textual, visual and, acoustic content), and more importantly, the quality of each modality varies greatly for different micro-videos.

In this work, we focus on enhancing the acoustic modality for the venue category estimation task. This is motivated by our finding that although the acoustic signal can well complement the visual and textual signal in reflecting a micro-video's venue, its quality is usually relatively lower. As such, simply integrating acoustic features with visual and textual features only leads to suboptimal results, or even adversely degrades the overall performance (cf the barrel theory). To address this, we propose to compensate the shortest board --- the acoustic modality --- via harnessing the external sound knowledge. We develop a deep transfer model which can jointly enhance the concept-level representation of micro-videos and the venue category prediction. To alleviate the sparsity problem of unpopular categories, we further regularize the representation learning of micro-videos of the same venue category. Through extensive experiments on a real-world dataset, we show that our model significantly outperforms the state-of-the-art method in terms of both Micro-F1 and Macro-F1 scores by leveraging the external acoustic knowledge.

References

[1]

Khalid Ashraf, Benjamin Elizalde, Forrest Iandola, Matthew Moskewicz, Julia Bernd, Gerald Friedland, and Kurt Keutzer. 2015. Audio-based multimedia event detection with DNNs and sparse sampling ICMR. 611--614.

Abstract

References

Cited By

Index Terms

Recommendations

Towards Micro-video Understanding by Joint Sequential-Sparse Modeling

Multimodal Learning toward Micro-Video Understanding

Improving Micro-video Recommendation by Controlling Position Bias

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations