Learning Trailer Moments in Full-Length Movies with Co-Contrastive Attention

Lezi Wang¹²,
Dong Liu¹³,
Rohit Puri¹⁴ &
…
Dimitris N. Metaxas¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12363))

Included in the following conference series:

European Conference on Computer Vision

4136 Accesses

Abstract

A movie’s key moments stand out of the screenplay to grab an audience’s attention and make movie browsing efficient. But a lack of annotations makes the existing approaches not applicable to movie key moment detection. To get rid of human annotations, we leverage the officially-released trailers as the weak supervision to learn a model that can detect the key moments from full-length movies. We introduce a novel ranking network that utilizes the Co-Attention between movies and trailers as guidance to generate the training pairs, where the moments highly corrected with trailers are expected to be scored higher than the uncorrelated moments. Additionally, we propose a Contrastive Attention module to enhance the feature representations such that the comparative contrast between features of the key and non-key moments are maximized. We construct the first movie-trailer dataset, and the proposed Co-Attention assisted ranking network shows superior performance even over the supervised(The term “supervised” refers to the approach with access to the manual ground-truth annotations for training.) approach. The effectiveness of our Contrastive Attention module is also demonstrated by the performance improvement over the state-of-the-art on the public benchmarks.

L. Wang—This work was done when Lezi Wang worked as an intern at Netflix.

R. Puri—This work was done when Rohit Puri was with Netflix.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Impulsion of Movie’s Content-Based Factors in Multi-modal Movie Recommendation System

FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild

Article Open access 23 February 2024

MFMGC: A Multi-modal Data Fusion Model for Movie Genre Classification

Notes

1.
https://en.wikipedia.org/wiki/Trailer_(promotion).
2.
https://pytorch.org/.
3.
Our Co-Attention module is not applicable for the VHD task since there are no video pairs in VHD as the trailer-movie pairs in MTMD.

References

Bamman, D., O’Connor, B., Smith, N.A.: Learning latent personas of film characters. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 352–361 (2013)
Google Scholar
Cai, S., Zuo, W., Davis, L.S., Zhang, L.: Weakly-supervised video summarization using variational encoder-decoder and web prior. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 184–200 (2018)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6299–6308 (2017)
Google Scholar
Chu, E., Roy, D.: Audio-visual sentiment analysis for learning emotional arcs in movies. In: 2017 IEEE International Conference on Data Mining (ICDM), pp. 829–834. IEEE (2017)
Google Scholar
Chu, W.S., Song, Y., Jaimes, A.: Video co-summarization: video summarization by visual co-occurrence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3584–3592 (2015)
Google Scholar
Gong, B., Chao, W.L., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: Advances in Neural Information Processing Systems (NeurIPs), pp. 2069–2077 (2014)
Google Scholar
Gygli, M., Grabner, H., Van Gool, L.: Video summarization by learning submodular mixtures of objectives. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3090–3098 (2015)
Google Scholar
Gygli, M., Song, Y., Cao, L.: Video2GIF: automatic generation of animated gifs from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1001–1009 (2016)
Google Scholar
Hara, K., Kataoka, H., Satoh, Y.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet? In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6546–6555 (2018)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Huang, Q., Xiong, Y., Xiong, Y., Zhang, Y., Lin, D.: From trailers to storylines: an efficient way to learn from movies. arXiv preprint arXiv:1806.05341 (2018)
Kang, H.W., Matsushita, Y., Tang, X., Chen, X.Q.: Space-time video montage. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 1331–1338. IEEE (2006)
Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1725–1732 (2014)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lee, D.H.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning, ICML, vol. 3, p. 2 (2013)
Google Scholar
Mahasseni, B., Lam, M., Todorovic, S.: Unsupervised video summarization with adversarial LSTM networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 202–211 (2017)
Google Scholar
McInnes, L., Healy, J., Melville, J.: UMAP: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)
Mendi, E., Clemente, H.B., Bayrak, C.: Sports video summarization based on motion analysis. Comput. Electr. Eng. 39(3), 790–796 (2013)
Article Google Scholar
Oosterhuis, H., Ravi, S., Bendersky, M.: Semantic video trailers. arXiv preprint arXiv:1609.01819 (2016)
Panda, R., Das, A., Wu, Z., Ernst, J., Roy-Chowdhury, A.K.: Weakly supervised summarization of web videos. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3657–3666 (2017)
Google Scholar
Panda, R., Roy-Chowdhury, A.K.: Collaborative summarization of topic-related videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7083–7092 (2017)
Google Scholar
Park, S.B., Kim, Y.W., Uddin, M.N., Jo, G.S.: Character-net: character network analysis from video. In: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, vol. 01. pp. 305–308. IEEE Computer Society (2009)
Google Scholar
Potapov, D., Douze, M., Harchaoui, Z., Schmid, C.: Category-specific video summarization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 540–555. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_35
Chapter Google Scholar
Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3D residual networks. In: proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5533–5541 (2017)
Google Scholar
Rui, Y., Gupta, A., Acero, A.: Automatically extracting highlights for TV baseball programs. In: Proceedings of the Eighth ACM International Conference on Multimedia, pp. 105–115. ACM (2000)
Google Scholar
Shi, W., Gong, Y., Ding, C., MaXiaoyu Tao, Z., Zheng, N.: Transductive semi-supervised deep learning using min-max features. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 299–315 (2018)
Google Scholar
Simões, G.S., Wehrmann, J., Barros, R.C., Ruiz, D.D.: Movie genre classification with convolutional neural networks. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 259–266. IEEE (2016)
Google Scholar
Smeaton, A.F., Over, P., Doherty, A.R.: Video shot boundary detection: seven years of TRECVid activity. Comput. Vis. Image Underst. (CVIU) 114(4), 411–418 (2010)
Article Google Scholar
Smith, J.R., Joshi, D., Huet, B., Hsu, W., Cota, J.: Harnessing AI for augmenting creativity: application to movie trailer creation. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 1799–1808 (2017)
Google Scholar
Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: TVSum: summarizing web videos using titles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (ICCV), pp. 5179–5187 (2015)
Google Scholar
Sun, M., Farhadi, A., Seitz, S.: Ranking domain-specific highlights by analyzing edited videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 787–802. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_51
Chapter Google Scholar
Tang, H., Kwatra, V., Sargin, M.E., Gargi, U.: Detecting highlights in sports videos: cricket as a test case. In: 2011 IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. IEEE (2011)
Google Scholar
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (ICCV), pp. 4631–4640 (2016)
Google Scholar
Thomson, D.: Moments That Made the Movies. Thames & Hudson (2014). https://books.google.com/books?id=_vNFngEACAAJ
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems (NeurIPS), vol. 30, pp. 5998–6008. Curran Associates Inc. (2017)
Google Scholar
Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794–7803 (2018)
Google Scholar
Weng, C.Y., Chu, W.T., Wu, J.L.: RoleNet: movie analysis from the perspective of social networks. IEEE Trans. Multimedia 11(2), 256–271 (2009)
Article Google Scholar
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning for video understanding. In: Proceedings of the European Conference on Computer Vision ECCV (2018)
Google Scholar
Xiong, B., Kalantidis, Y., Ghadiyaram, D., Grauman, K.: Less is more: learning highlight detection from video duration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1258–1267 (2019)
Google Scholar
Xiong, Z., Radhakrishnan, R., Divakaran, A., Huang, T.S.: Highlights extraction from sports video based on an audio-visual marker detection framework. In: 2005 IEEE International Conference on Multimedia and Expo (ICME), p. 4. IEEE (2005)
Google Scholar
Yang, H., Wang, B., Lin, S., Wipf, D., Guo, M., Guo, B.: Unsupervised extraction of video highlights via robust recurrent auto-encoders. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 4633–4641 (2015)
Google Scholar
Yao, T., Mei, T., Rui, Y.: Highlight detection with pairwise deep ranking for first-person video summarization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 982–990 (2016)
Google Scholar
Zhou, H., Hermans, T., Karandikar, A.V., Rehg, J.M.: Movie genre classification via scene categorization. In: Proceedings of the 18th ACM International Conference on Multimedia, pp. 747–750. ACM (2010)
Google Scholar
Zhu, Y., et al.: Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 19–27 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

Rutgers University, New Brunswick, USA
Lezi Wang & Dimitris N. Metaxas
Netflix, Los Gatos, USA
Dong Liu
Twitch, San Francisco, USA
Rohit Puri

Authors

Lezi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Rohit Puri
View author publications
You can also search for this author in PubMed Google Scholar
Dimitris N. Metaxas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lezi Wang .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 63246 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, L., Liu, D., Puri, R., Metaxas, D.N. (2020). Learning Trailer Moments in Full-Length Movies with Co-Contrastive Attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12363. Springer, Cham. https://doi.org/10.1007/978-3-030-58523-5_18

Download citation

DOI: https://doi.org/10.1007/978-3-030-58523-5_18
Published: 04 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58522-8
Online ISBN: 978-3-030-58523-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics