[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Jointly Learning the Attributes and Composition of Shots for Boundary Detection in Videos

Published: 01 January 2022 Publication History

Abstract

In film making, shot has a profound influence on how the movie content is delivered and how the audiences are echoed, where different emotions and contents can be delivered through well-designed camera movements or shot editing. Therefore, in pursuit of high-level understanding of long videos, accurate shot detection from untrimmed videos should be considered as the first and the most fundamental step. Existing approaches address this problem based on the visual differences and content transitions between consecutive frames, while ignoring intrinsic shot attributes, <italic>viz.</italic>, camera movements, scales, and viewing angles, which essentially reveal how each shot is created. In this work, we propose a new learning framework (SCTSNet) for shot boundary detection by jointly recognizing the attributes and composition of shots in videos. To facilitate the analysis of shots and the evaluation of shot detection models, we collect a large-scale shot boundary dataset <italic>MovieShots2</italic>, which contains <inline-formula><tex-math notation="LaTeX">$\text{15}\,K$</tex-math></inline-formula> shots from 282 movie clips. It is richly annotated with the temporal boundary between consecutive shots and individual shot attributes, including camera movements, scales, and viewing angles, which are the three most distinct shot attributes. Our experiments show that the joint learning framework can significantly boost the boundary detection performance, surpassing the previous scores by a large margin. SCTSNet improves shot boundary detection AP from 0.65 to 0.77, pushing the performance to a new level.

References

[1]
I. Cherif, V. Solachidis, and I. Pitas, “Shot type identification of movie content,” in Proc. IEEE 9th Int. Symp. Signal Process. Its Appl., 2007, pp. 1–4.
[2]
G. L. Priya and S. Domnic, “Edge strength extraction using orthogonal vectors for shot boundary detection,”Procedia Technol., vol. 6, pp. 247–254, 2012.
[3]
H. Wang, A. Divakaran, A. Vetro, S.-F. Chang, and H. Sun, “Survey of compressed-domain features used in audio-visual indexing and analysis,”J. Vis. Commun. Image Representation, vol. 14, no. 2, pp. 150–183, 2003.
[4]
N. J. Janwe and K. K. Bhoyar, “Video shot boundary detection based on jnd color histogram,” in Proc. IEEE 2nd Int. Conf. Image Inf. Process., 2013, pp. 476–480.
[5]
L. Canini, S. Benini, and R. Leonardi, “Classifying cinematographic shot types,”Multimedia Tools Appl., vol. 62, no. 1, pp. 51–73, 2013.
[6]
I. Karakostas, I. Mademlis, N. Nikolaidis, and I. Pitas, “Shot type constraints in UAV cinematography for autonomous target tracking,”Inf. Sci., vol. 506, pp. 273–294, 2020.
[7]
L. D. Giannetti and J. Leach, Understanding Movies. Upper Saddle River, New Jersey, NJ, USA: Prentice Hall, 1999, vol. 1, no. 1.
[8]
A. Raoet al., “A unified framework for shot type classification based on subject centric lens,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 1066–1083.
[9]
B. Castellano, “Pyscenedetect: Intelligent scene cut detection and video splitting tool,” 2018. Accessed: Oct. 20, 2020. [Online]. Available: https://pyscenedetect.readthedocs.io/en/latest/
[10]
Z.-M. Lu and Y. Shi, “Fast video shot boundary detection based on svd and pattern matching,”IEEE Trans. Image Process., vol. 22, no. 12, pp. 5136–5145, Dec.2013.
[11]
L. Wu, S. Zhang, M. Jian, Z. Lu, and D. Wang, “Two stage shot boundary detection via feature fusion and spatial-temporal convolutional neural networks,”IEEE Access, vol. 7, pp. 77 268–77 276, 2019. [Online]. Available: https://ieeexplore.ieee.org/document/8734067
[12]
J. Xu, L. Song, and R. Xie, “Shot boundary detection using convolutional neural networks,” in Proc. Vis. Commun. Image Process., 2016, pp. 1–4.
[13]
W. Tong, L. Song, X. Yang, H. Qu, and R. Xie, “CNN-based shot boundary detection and video annotation,” in Proc. IEEE Int. Symp. Broadband Multimedia Syst. Broadcast., 2015, pp. 1–5.
[14]
M. Xuet al., “Using context saliency for movie shot classification,” in Proc. 18th IEEE Int. Conf. Image Process., 2011, pp. 3653–3656.
[15]
M. Svanera, S. Benini, N. Adami, R. Leonardi, and A. B. Kovács, “Over-the-shoulder shot detection in art films,” in Proc. 13th Int. Workshop Content-Based Multimedia Indexing, 2015, pp. 1–6.
[16]
H. Jiang and M. Zhang, “Tennis video shot classification based on support vector machine,” in Proc. IEEE Int. Conf. Comput. Sci. Automat. Eng., vol. 2, 2011, pp. 757–761.
[17]
H. L. Wang and L.-F. Cheong, “Taxonomy of directing semantics for film shot classification,”IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 10, pp. 1529–1542, Oct.2009.
[18]
S. Bhattacharya, R. Mehran, R. Sukthankar, and M. Shah, “Classification of cinematographic shots using lie algebra and its application to complex event recognition,”IEEE Trans. Multimedia, vol. 16, no. 3, pp. 686–696, Apr.2014.
[19]
J.-C. Linet al., “Coherent deep-net fusion to classify shots in concert videos,”IEEE Trans. Multimedia, vol. 20, no. 11, pp. 3123–3136, Nov.2018.
[20]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,”Int. Conf. Learn. Represent., 2015. [Online]. Available: https://www.robots.ox.ac.uk/~vgg/publications/2015/Simonyan15/
[21]
I. Koprinska and S. Carrato, “Temporal video segmentation: A survey,”Signal Process.: Image Commun., vol. 16, no. 5, pp. 477–500, 2001.
[22]
C. Y. Beevi and S. Natarajan, “An efficient video segmentation algorithm with real time adaptive threshold technique,”Imag. Process. Pattern Recognit., vol. 2, no. 4, pp. 13–28, 2009.
[23]
T. Lindeberg, “Scale invariant feature transform,”Scholarpedia, vol. 7, 2012, Art. no.
[24]
J. Baber, N. Afzulpurkar, and S. Satoh, “A framework for video segmentation using global and local features,”Int. J. Pattern Recognit. Artif. Intell., vol. 27, no. 5, 2013, Art. no.
[25]
S. Tippaya, S. Sitjongsataporn, T. Tan, M. M. Khan, and K. Chamnongthai, “Multi-modal visual features-based video shot boundary detection,”IEEE Access, vol. 5, pp. 12563–12575, 2017. [Online]. Available: https://ieeexplore.ieee.org/document/7954599
[26]
J. Yuanet al., “A formal study of shot boundary detection,”IEEE Trans. Circuits Syst. Video Technol., vol. 17, no. 2, pp. 168–186, Feb.2007.
[27]
B. Luo, H. Li, T. Song, and C. Huang, “Object segmentation from long video sequences,” in Proc. 23rd ACM Int. Conf. Multimedia, 2015, pp. 1187–1190.
[28]
D. Ghadiyaram, D. Tran, and D. Mahajan, “Large-scale weakly-supervised pre-training for video action recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 12046–12055.
[29]
Q. Huang, Y. Xiong, A. Rao, J. Wang, and D. Lin, “MovieNet: A holistic dataset for movie understanding,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 709–727.
[30]
J. Xia, A. Rao, L. Xu, Q. Huang, J. Wen, and D. Lin, “Online multi-modal person search in videos,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 174–190.
[31]
A. Raoet al., “A local-to-global approach to multi-modal movie scene segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10146–10155.
[32]
L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Video captioning with attention-based LSTM and semantic consistency,”IEEE Trans. Multimedia, vol. 19, no. 9, pp. 2045–2055, Sep.2017.
[33]
J. Dong, X. Li, and C. G. Snoek, “Predicting visual features from text for image and video caption retrieval,”IEEE Trans. Multimedia, vol. 20, no. 12, pp. 3377–3388, Dec.2018.
[34]
J. Chen, Y. Pan, Y. Li, T. Yao, H. Chao, and T. Mei, “Temporal deformable convolutional encoder-decoder networks for video captioning,” in Proc. AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 8167–8174.
[35]
M. Gygli, Y. Song, and L. Cao, “Video2GIF: Automatic generation of animated gifs from video,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1001–1009.
[36]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3 d convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 4489–4497.
[37]
J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 6299–6308.
[38]
C. Zhang and Y. Tian, “Automatic video description generation via lstm with joint two-stream encoding,” in Proc. 23rd Int. Conf. Pattern Recognit., 2016, pp. 2924–2929.
[39]
R. Hou, C. Chen, and M. Shah, “Tube convolutional neural network (T-CNN) for action detection in videos,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 5822–5831.
[40]
D. Tran, H. Wang, L. Torresani, and M. Feiszli, “Video classification with channel-separated convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 5552–5561.
[41]
S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, “Sequence to sequence-video to text,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 4534–4542.
[42]
Y. Lu, C. Lu, and C.-K. Tang, “Online video object detection using association lstm,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 2344–2352.
[43]
J. Donahueet al., “Long-term recurrent convolutional networks for visual recognition and description,” in Proc. IEEE Conf. Comput. Vis. pattern Recognit., 2015, pp. 2625–2634.
[44]
L. Wanget al., “Temporal segment networks for action recognition in videos,”IEEE Trans. Pattern Ana. Mach. Intell., vol. 41, no. 11, pp. 2740–2755, Nov.2019.
[45]
L.-Y. Duan, M. Xu, Q. Tian, C.-S. Xu, and J. S. Jin, “A unified framework for semantic shot classification in sports video,”IEEE Trans. Multimedia, vol. 7, no. 6, pp. 1066–1083, Dec.2005.
[46]
NIST, “Trec video retrieval evaluation home page,” Accessed: Oct 20, 2020. [Online]. Available: https://trecvid.nist.gov/
[47]
L. Li, X. Zhang, W. Hu, W. Li, and P. Zhu, “Soccer video shot classification based on color characterization using dominant sets clustering,” in Proc. Pacific-Rim Conf. Multimedia. Springer, 2009, pp. 923–929.
[48]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
[49]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[50]
O. Russakovskyet al., “ImageNet large scale visual recognition challenge,”Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015.
[51]
J. S. Boreczky and L. A. Rowe, “Comparison of video shot boundary detection techniques,”J. Electron. Imag., vol. 5, no. 2, pp. 122–129, 1996.
[52]
A. Hanjalic, “Shot-boundary detection: Unraveled and resolved?”IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 2, pp. 90–105, Feb.2002.
[53]
B. Zhao, X. Li, and X. Lu, “HSA-RNN: Hierarchical structure-adaptive RNN for video summarization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 7405–7414.
[54]
A. Vaswaniet al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
[55]
A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm and other neural network architectures,”Neural Netw., vol. 18, no. 5/6, pp. 602–610, 2005.

Cited By

View all
  • (2023)A Coarse-to-Fine Framework for Automatic Video UnscreenIEEE Transactions on Multimedia10.1109/TMM.2022.315017725(2723-2733)Online publication date: 1-Jan-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Multimedia
IEEE Transactions on Multimedia  Volume 24, Issue
2022
2475 pages

Publisher

IEEE Press

Publication History

Published: 01 January 2022

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A Coarse-to-Fine Framework for Automatic Video UnscreenIEEE Transactions on Multimedia10.1109/TMM.2022.315017725(2723-2733)Online publication date: 1-Jan-2023

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media