Abstract
Automatic labelling of speakers is an essential task for speakers diarization in parliamentary debates given the huge amount of video data to annotate. In this paper, we address the speaker diarization problem as a visual speaker re-identification issue with a special emphasis on the analysis of different shot types. We propose two approaches that makes use of convolutional neural networks (CNN) and biometric traits for keyframe extraction. Experimental results have been evaluated with challenging real-world datasets from the Canary Islands Parliament, and contrasted with a similar approach that does not analyze the shot type. Results show that the use of CNN for shot classification and biometric traits help to improve the performance of the re-identification outcomes in an average rate of 9.8 %.
This work has been partially supported by the Spanish Government under the projects TIN2011-24598 and TIN2015-64395-R.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Barra-Chicote, R., Pardo, J.M., Ferreiros, J., Montero, J.M.: Speaker diarization based on intensity channel contribution. IEEE Trans. Audio Speech Lang. Process. 19(4), 754–761 (2011)
Castrillón, M., Déniz, O., Hernández, D., Lorenzo, J.: A comparison of face and facial feature detectors based on the violajones general object detection framework. Mach. Vis. Appl. 22(3), 481–494 (2011)
Cong, D.-N.T., Khoudour, L., Achard, C., Meurie, C., Lezoray, O.: People re-identification by spectral classification of silhouettes. Sig. Process. 90(8), 2362–2374 (2010). Special Section on Processing and Analysis of High-Dimensional Masses of Image and Signal Data
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, London (2006)
Garau, G., Bourlard, H.: Using audio and visual cues for speaker diarisation initialisation. In: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4942–4945 (2010)
Kapsouras, I., Tefas, A., Nikolaidis, N., Peeters, G., Benaroya, L., Pitas. I.: Multimodal speaker clustering in full length movies. Multimed. Tools Appl. 1–20 (2016). doi:10.1007/s11042-015-3181-5
Kumar, N., Berg, A.C., Belhumeur, P.N., Nayar, S.K.: Describable visual attributes for face verification and image search. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1962–1977 (2011)
Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)
Noulas, A., Englebienne, G., Krose, B.J.A.: Multimodal speaker diarization. IEEE Trans. Pattern Anal. Mach. Intell. 34(1), 79–93 (2012)
Sánchez-Nielsen, E., Chávez-Gutiérrez, F., Lorenzo-Navarro, J., Castrillón-Santana, M.: A multimedia system to produce and deliver video fragments on demand on parliamentary websites. Multimed. Tools Appl. 1–27 (2016). doi:10.1007/s11042-016-3306-5
Sao, N., Mishra, R.: A survey based on video shot boundary detection techniques. Int. J. Adv. Res. Comput. Commun. Eng. (IJARCCE) 3(4) (2014)
Sarafianos, N., Giannakopoulos, T., Petridis, S.: Audio-visual speaker diarization using fisher linear semi-discriminant analysis. Multimed. Tools Appl. 75(1), 115–130 (2016)
Sujatha, C., Mudenagudi, U.: A study on keyframe extraction methods for video summary. In: 2011 International Conference on Computational Intelligence and Communication Networks (CICN), pp. 73–77 (2011)
Teixeira, T., Dublon, G., Savvides, A.: A survey of human-sensing: methods for detecting presence, count, location, track, and identity. ACM Comput. Surv. 5, 1–77 (2010)
Tranter, S.E., Reynolds, D.A.: An overview of automatic speaker diarization systems. IEEE Trans. Audio Speech Lang. Process. 14(5), 1557–1565 (2006)
Vallet, F., Essid, S., Carrive, J.: A multimodal approach to speaker diarization on TV talk-shows. IEEE Trans. Multimed. 15(3), 509–520 (2013)
Viola, P., Jones, M.J.: Robust real-time face detection. Int. J. Comput. Vis. 57(2), 151–173 (2004)
Zhao, W., Chellappa, R., Phillips, P.J., Rosenfeld, A.: Face recognition: a literature survey. Assoc. Comput. Mach. 35(4), 399–458 (2003). http://doi. acm.org/10.1145/954339.954342
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Marín-Reyes, P.A., Lorenzo-Navarro, J., Castrillón-Santana, M., Sánchez-Nielsen, E. (2016). Shot Classification and Keyframe Detection for Vision Based Speakers Diarization in Parliamentary Debates. In: Luaces , O., et al. Advances in Artificial Intelligence. CAEPIA 2016. Lecture Notes in Computer Science(), vol 9868. Springer, Cham. https://doi.org/10.1007/978-3-319-44636-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-44636-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44635-6
Online ISBN: 978-3-319-44636-3
eBook Packages: Computer ScienceComputer Science (R0)