[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2647868.2654913acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events

Published: 03 November 2014 Publication History

Abstract

This paper proposes a new video representation for few-example event recognition and translation. Different from existing representations, which rely on either low-level features, or pre-specified attributes, we propose to learn an embedding from videos and their descriptions. In our embedding, which we call VideoStory, correlated term labels are combined if their combination improves the video classifier prediction. Our proposed algorithm prevents the combination of correlated terms which are visually dissimilar by optimizing a joint-objective balancing descriptiveness and predictability. The algorithm learns from textual descriptions of video content, which we obtain for free from the web by a simple spidering procedure. We use our VideoStory representation for few-example recognition of events on more than 65K challenging web videos from the NIST TRECVID event detection task and the Columbia Consumer Video collection. Our experiments establish that i) VideoStory outperforms an embedding without joint-objective and alternatives without any embedding, ii) The varying quality of input video descriptions from the web is compensated by harvesting more data, iii) VideoStory sets a new state-of-the-art for few-example event recognition, outperforming very recent attribute and low-level motion encodings. What is more, VideoStory translates a previously unseen video to its most likely description from visual content only.

References

[1]
Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for attribute-based classification. In CVPR, 2013.
[2]
R. Aly et al. The AXES submissions at trecvid 2013. In TRECVID, 2013.
[3]
S. Banerjee and T. Pedersen. Extended gloss overlaps as a measure of semantic relatedness. In IJCAI, 2003.
[4]
T. Berg, A. Berg, and J. Shih. Automatic attribute discovery and characterization from noisy web data. In ECCV, 2010.
[5]
L. Bottou. Large-scale machine learning with stochastic gradient descent. In ICCS, 2010.
[6]
Q. Chen et al. Spatio-temporal fisher vector coding for surveillance event detection. In ACM MM, 2013.
[7]
P. Das, R. Srihari, and J. Corso. Translating related words to videos and back through latent topics. In WSDM, 2013.
[8]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
[9]
J. Dodge et al. Detecting visual text. In NAACL, 2012.
[10]
S. Guadarrama et al. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV, 2013.
[11]
A. Habibian and C. Snoek. Recommendations for recognizing video events by concept vocabularies. CVIU, 124, 2014.
[12]
T. Hofmann. Probabilistic latent semantic indexing. In ACM SIGIR, 1999.
[13]
H. Izadinia and M. Shah. Recognizing complex events using large margin joint low-level event model. In ECCV, 2012.
[14]
M. Jain, H. Jégou, and P. Bouthemy. Better exploiting motion for better action recognition. In CVPR, 2013.
[15]
Y.-G. Jiang, G. Ye, S.-F. Chang, D. Ellis, and A. Loui. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In ICMR, 2011.
[16]
Y.-G. Jiang, X. Zeng, G. Ye, S. Bhattacharya, D. Ellis, M. Shah, and S.-F. Chang. Columbia-ucf trecvid2010 multimedia event detection: Combining multiple modalities, contextual concepts, and temporal matching. In TRECVID, 2010.
[17]
L. Kennedy, S.-F. Chang, and I. Kozintsev. To search or to label?: predicting the performance of search-based automatic image classifiers. In ACM MIR, 2006.
[18]
D. Klein and C. Manning. Accurate unlexicalized parsing. In ACL, 2003.
[19]
A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
[20]
P. Kuznetsova, V. Ordonez, A. Berg, T. Berg, and Y. Choi. Generalizing image captions for image-text parallel corpus. In ACL, 2013.
[21]
J. Liu et al. Video event recognition using concept attributes. In WACV, 2013.
[22]
Z. Ma, Y. Yang, Y. Cai, N. Sebe, and A. Hauptmann. Knowledge adaptation for ad hoc multimedia event detection with few exemplars. In ACM MM, 2012.
[23]
Z. Ma, Y. Yang, Z. Xu, N. Sebe, and A. Hauptmann. We are not equally negative: fine-grained labeling for multimedia event detection. In ACM MM, 2013.
[24]
M. Mazloom, E. Gavves, K. van de Sande, and C. Snoek. Searching informative concept banks for video event detection. In ICMR, 2013.
[25]
T. Mei, Y. Rui, S. Li, and Q. Tian. Multimedia search reranking: A literature survey. ACM CS, 46(3), 2014.
[26]
M. Merler, B. Huang, L. Xie, G. Hua, and A. Natsev. Semantic model vectors for complex video event recognition. IEEE TMM, 14(1), 2012.
[27]
P. Natarajan et al. Multimodal feature fusion for robust event detection in web videos. In CVPR, 2012.
[28]
A. Natsev, M. Naphade, and J. Tešić. Learning the semantics of multimedia queries and concepts from a small number of examples. In ACM MM, 2005.
[29]
P. Over, J. Fiscus, G. Sanders, et al. TRECVID 2013-an overview of the goals, tasks, data, evaluation mechanisms and metrics. In TRECVID, 2013.
[30]
H. Peng, F. Long, and C. Ding. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE TPAMI, 27(8), 2005.
[31]
N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. Lanckriet, R. Levy, and N. Vasconcelos. A new approach to cross-modal multimedia retrieval. In ACM MM, 2010.
[32]
J. Sáanchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the fisher vector: Theory and practice. IJCV, 105(3), 2013.
[33]
S. Strassel et al. Creating havic: Heterogeneous audio visual internet collection. In LREC, 2012.
[34]
A. Tamrakar et al. Evaluation of low-level features and their combinations for complex event detection in open source videos. In CVPR, 2012.
[35]
A. Ulges, C. Schulze, M. Koch, and T. Breuel. Learning automatic concept detectors from online video. CVIU, 114(4), 2010.
[36]
K. van de Sande, T. Gevers, and C. Snoek. Evaluating color descriptors for object and scene recognition. IEEE TPAMI, 32(9), 2010.
[37]
H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, 2013.
[38]
Q. Wang, J. Xu, H. Li, and N. Craswell. Regularized latent semantic indexing. In ACM SIGIR, 2011.
[39]
X.-J. Wang, L. Zhang, X. Li, and W.-Y. Ma. Annotating images by mining image search results. IEEE TPAMI, 30(11), 2008.
[40]
J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, 2011.
[41]
Y. Yang and M. Shah. Complex events detection using data-driven concepts. In ECCV, 2012.

Cited By

View all
  • (2024)Cross-Modal Learning for Free-Text Video SearchEncyclopedia of Information Science and Technology, Sixth Edition10.4018/978-1-6684-7366-5.ch088(1-15)Online publication date: 1-Jul-2024
  • (2024)Improving Video Retrieval Performance with Query Expansion Using ChatGPTProceedings of the 2024 7th International Conference on Image and Graphics Processing10.1145/3647649.3647716(431-436)Online publication date: 19-Jan-2024
  • (2024)Towards Temporal Event Detection: A Dataset, Benchmarks and ChallengesIEEE Transactions on Multimedia10.1109/TMM.2023.327652326(1102-1113)Online publication date: 2024
  • Show More Cited By

Index Terms

  1. VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '14: Proceedings of the 22nd ACM international conference on Multimedia
    November 2014
    1310 pages
    ISBN:9781450330633
    DOI:10.1145/2647868
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 November 2014

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. multimedia event recognition
    2. representation learning
    3. semantic representation

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '14
    Sponsor:
    MM '14: 2014 ACM Multimedia Conference
    November 3 - 7, 2014
    Florida, Orlando, USA

    Acceptance Rates

    MM '14 Paper Acceptance Rate 55 of 286 submissions, 19%;
    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)5
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 03 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Cross-Modal Learning for Free-Text Video SearchEncyclopedia of Information Science and Technology, Sixth Edition10.4018/978-1-6684-7366-5.ch088(1-15)Online publication date: 1-Jul-2024
    • (2024)Improving Video Retrieval Performance with Query Expansion Using ChatGPTProceedings of the 2024 7th International Conference on Image and Graphics Processing10.1145/3647649.3647716(431-436)Online publication date: 19-Jan-2024
    • (2024)Towards Temporal Event Detection: A Dataset, Benchmarks and ChallengesIEEE Transactions on Multimedia10.1109/TMM.2023.327652326(1102-1113)Online publication date: 2024
    • (2024)A Survey of Video Datasets for Grounded Event Understanding2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)10.1109/CVPRW63382.2024.00727(7314-7327)Online publication date: 17-Jun-2024
    • (2024)Fine-Grained Length Controllable Video Captioning With Ordinal EmbeddingsIEEE Access10.1109/ACCESS.2024.350675112(189667-189688)Online publication date: 2024
    • (2023)Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video RetrievalComputer Vision – ECCV 2022 Workshops10.1007/978-3-031-25069-9_40(627-643)Online publication date: 14-Feb-2023
    • (2021)Zero-shot Video Emotion Recognition via Multimodal Protagonist-aware Transformer NetworkProceedings of the 29th ACM International Conference on Multimedia10.1145/3474085.3475647(1074-1083)Online publication date: 17-Oct-2021
    • (2021)Dual Encoding for Video Retrieval by TextIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2021.3059295(1-1)Online publication date: 2021
    • (2020)A Knowledge-Driven Multimedia Retrieval System Based on Semantics and Deep FeaturesFuture Internet10.3390/fi1211018312:11(183)Online publication date: 28-Oct-2020
    • (2020)Incremental transfer learning for video annotation via grouped heterogeneous sourcesIET Computer Vision10.1049/iet-cvi.2018.573014:1(26-35)Online publication date: 20-Jan-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media