Interp-SUM: Unsupervised Video Summarization with Piecewise Linear Interpolation
<p>Overview: Our goal is to select keyframes with high importance scores to summarize the video based on the reinforcement learning method.</p> "> Figure 2
<p>Interpolation-based video summarization framework.</p> "> Figure 3
<p>Transformer network and CNN based Video Summarization Network (TCVSN).</p> "> Figure 4
<p>Piecewise linear interpolation method to interpolate the importance score candidate to the importance score.</p> "> Figure 5
<p>Average result (%) for different importance score candidate sizes (20, 35, 50) to be interpolated to importance score on SumMe and TVSum dataset.</p> "> Figure 6
<p>Visualized importance scores and sampled summary frame images of the video which is ‘Air Force One’ in the SumMe dataset [<a href="#B13-sensors-21-04562" class="html-bibr">13</a>]. The gray bars are the groundtruth summary importance scores. Red bars are the top 1/3 of importance scores from the generated importance scores by different variants of our approach.</p> ">
Abstract
:1. Introduction
- We propose an unsupervised video summarization method, with piecewise linear interpolation, to mitigate the high variance problem and to generate a natural sequence of summary frames. It reduces the size of the output layers of the network. It also makes the network learn faster because the network needs to predict only the short importance score in this method.
- We present Transformer network and CNN based video summarization network (TCVSN) to generate importance score well.
- We develop a novelty reward that measures the novelty of the selected keyframes.
- We develop the modified reconstruction loss with random masking to promote the representativeness of the summary.
2. Background and Related Work
2.1. Video Summarization
2.2. Policy Gradient Method
3. Unsupervised Video Summarization with Piecewise Linear Interpolation
3.1. Generating Importance Score Candidate
3.2. Importance Score Interpolation
3.3. Training with Policy Gradient
3.4. Regularization and Reconstruction Loss
Algorithm 1. Training Video Summarization Network. |
1: Input: Frame-level features of the video |
2: Output: TCVSN parameters () |
3: |
4: for number of iterations do |
5: Frame-level features of the video |
6: TCVSN()% Generate candidate |
7: Piecewise linear interpolation of |
8: Bernoulli()% Action from the score |
9: % Calculate Rewards and Loss using and |
10: % Update using policy gradient method: |
11: % Minimization |
12: end for |
3.5. Generating Video Summary
4. Experiments
4.1. Dataset
4.2. Evaluation Setup
4.3. Implementation Details
4.4. Quantitative Evaluation
4.5. Qualitative Evaluation
- Interp-SUM is our proposed method with no changes.
- Interp-SUM w/o UREX which does not use the exploring under-appreciated reward (UREX) method. This variant uses the method proposed in [4].
- Interp-SUM w/o Recon. Loss which does not use the modified reconstruction loss that we presented.
- Interp-SUM w/o Interp. which does not use the piecewise linear interpolation method. Our network is trained to predict the importance score directly.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Ejaz, N.; Mehmood, I.; Baik, S.W. Efficient visual attention based framework for extracting key frames from videos. J. Image Commun. 2013, 28, 34–44. [Google Scholar] [CrossRef]
- Mahasseni, B.; Lam, M.; Todorovic, S. Unsupervised Video Summarization with Adversarial LSTM Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 22–25 July 2017; pp. 202–211. [Google Scholar]
- Ji, J.; Xiong, K.; Pang, Y.; Li, X. Video Summarization with Attention-Based Encoder-Decoder Networks. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1709–1717. [Google Scholar] [CrossRef] [Green Version]
- Zhou, K.; Qiao, Y.; Xiang, T. Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward. AAAI Conf. Artif. Intell. 2018, 32, 7582–7589. [Google Scholar]
- Wu, C.; Rajeswaran, A.; Duan, Y.; Kumar, V.; Bayen, A.M.; Kakade, S.; Mordatch, I.; Abbeel, P. Variance Reduction for Policy Gradient with Action-Dependent Factorized Baselines. In Proceedings of the International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, B.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-augmented Transformer for Speech Recognition. arXiv 2020, arXiv:2005.08100. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
- Nachum, O.; Norouzi, M.; Schuurmans, D. Improving Policy Gradient by Exploring Under-Appreciated Rewards. arXiv 2016, arXiv:1611.09321. [Google Scholar]
- Zhang, K.; Chao, W.L.; Sha, F.; Grauman, K. Video Summarization with Long Short-term Memory. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Amsterdam, The Netherlands, 2016; pp. 766–782. [Google Scholar]
- Zhang, Y.; Kampffmeyer, M.; Zhao, X.; Tan, M. DTR-GAN: Dilated Temporal Relational Adversarial Network for Video Summarization. In Proceedings of the ACM Turing Celebration Conference (ACM TURC), Shanghai, China, 18 May 2018. [Google Scholar]
- Zhang, K.; Grauman, K.; Sha, F. Retrospective Encoders for Video Summarization. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Munich, Germany, 2018; pp. 391–408. [Google Scholar]
- Gygli, M.; Grabner, H.; Riemenschneider, H.; Gool, L.V. Creating summaries from user videos. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Munich, Germany, 2015; pp. 505–520. [Google Scholar]
- Song, Y.; Vallmitjana, J.; Stent, A.; Jaimes, A. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5179–5187. [Google Scholar]
- Rochan, M.; Ye, L.; Wang, Y. Video Summarization Using Fully Convolutional Sequence Networks. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Munich, Germany, 2018; pp. 347–363. [Google Scholar]
- Yuan, L.; Tay, F.E.; Li, P.; Zhou, L.; Feng, F. Cycle-SUM: Cycle-consistent Adversarial LSTM Networks for Unsupervised Video Summarization. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 9143–9150. [Google Scholar]
- Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic Policy Gradient Algorithms. In Proceedings of the 31st International Conference on International Conference on Machine Learning (ICML), Beijing, China, 21–26 June 2014; pp. 387–395. [Google Scholar]
- Liu, T. Compare and Select: Video Summarization with Multi-Agent Reinforcement Learning. arXiv 2020, arXiv:2007.14552. [Google Scholar]
- Song, X.; Chen, K.; Lei, J.; Sun, L.; Wang, Z.; Xie, L.; Song, M. Category driven deep recurrent neural network for video summarization. In Proceedings of the 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Seattle, WA, USA, 11–15 June 2016. [Google Scholar]
- Sutton, R.S.; McAllester, D.; Singh, S.; Masour, Y. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems (NIPS), Denver, CO, USA, 29 November–4 December 1999; pp. 1057–1063. [Google Scholar]
- Kingma, D.P.; Ba, J.L. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference for Learning Representations (ICML), San Diego, CA, USA, 5–8 May 2015. [Google Scholar]
- Yu, Y. Towards Sample Efficient Reinforcement Learning. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI), Stockholm, Sweden, 13–19 July 2018; pp. 5739–5743. [Google Scholar]
- Lehnert, L.; Laroche, R.; Seijen, H.V. On Value Function Representation of Long Horizon Problems. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 3457–3465. [Google Scholar]
- Yao, H.; Zhang, S.; Zhang, Y.; Li, J.; Tian, Q. Coarse-to-Fine Description for Fine-Grained Visual Categorization. IEEE Trans. Image Process. 2016, 25, 4858–4872. [Google Scholar] [CrossRef]
- Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
- Blu, T.; Thevenaz, P.; Unser, M. Linear Interpolation Revitalized. IEEE Trans. Image Process. 2018, 13, 710–719. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Uchida, S.; Sakoe, H. Piecewise Linear Two-Dimensional Warping. Syst. Comput. Jpn. 2001, 3, 534–537. [Google Scholar] [CrossRef]
- Norouzi, M.; Bensio, S.; Chen, Z.; Jaitly, N.; Schuster, M.; Wu, Y.; Schuurmans, D. Reward augmented maximum likelihood for neural structured prediction. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS), Barcelona, Spain, 5–10 December 2016; pp. 1731–1739. [Google Scholar]
- Kaufman, D.; Levi, G.; Hassner, T.; Wolf, L. Temporal Tessellation: A Unified Approach for Video Analysis. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 94–104. [Google Scholar]
- Rochan, M.; Wang, Y. Video Summarization by Learning from Unpaired Data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–21 June 2019; pp. 7902–7911. [Google Scholar]
- Apostolidis, E.; Metsai, A.I.; Adamantidou, E.; Mezaris, V.; Patras, I. A Stepwise, Label-based Approach for Improving the Adversarial Training in Unsupervised Video Summarization. In Proceedings of the 1st International Workshop on AI for Smart TV Content Production, Access and Delivery, Nice, France, 21 October 2019; pp. 17–25. [Google Scholar] [CrossRef] [Green Version]
- Apostolidis, E.; Adamantidou, E.; Metsai, A.; Mezaris, V.; Patras, I. Unsupervised Video Summarization via Attention-Driven Adversarial Learning. In Proceedings of the International Conference on Multimedia Modeling (MMM), Daejeon, Korea, 5–8 January 2020; pp. 492–504. [Google Scholar]
Method | SumMe | TVSum |
---|---|---|
w/o Interpolation | 42.32 | 55.32 |
w/o UREX | 42.46 | 57.42 |
w/o Reconstruction loss | 41.52 | 49.86 |
Ours (Interp-SUM) | 44.32 | 58.06 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yoon, U.-N.; Hong, M.-D.; Jo, G.-S. Interp-SUM: Unsupervised Video Summarization with Piecewise Linear Interpolation. Sensors 2021, 21, 4562. https://doi.org/10.3390/s21134562
Yoon U-N, Hong M-D, Jo G-S. Interp-SUM: Unsupervised Video Summarization with Piecewise Linear Interpolation. Sensors. 2021; 21(13):4562. https://doi.org/10.3390/s21134562
Chicago/Turabian StyleYoon, Ui-Nyoung, Myung-Duk Hong, and Geun-Sik Jo. 2021. "Interp-SUM: Unsupervised Video Summarization with Piecewise Linear Interpolation" Sensors 21, no. 13: 4562. https://doi.org/10.3390/s21134562
APA StyleYoon, U. -N., Hong, M. -D., & Jo, G. -S. (2021). Interp-SUM: Unsupervised Video Summarization with Piecewise Linear Interpolation. Sensors, 21(13), 4562. https://doi.org/10.3390/s21134562