More Web Proxy on the site http://driver.im/

research-article

Audiovisual speech synthesis

Authors:

Wesley Mattheyses,

Werner VerhelstAuthors Info & Claims

Speech Communication, Volume 66, Issue C

Pages 182 - 217

https://doi.org/10.1016/j.specom.2014.11.001

Published: 01 February 2015 Publication History

Abstract

Comprehensive overview of the various techniques for audiovisual speech synthesis.Innovative categorization of the techniques based on multiple aspects.Important future directives for the field of audiovisual speech synthesis.Bundles a lot of information that was scattered in the scientific literature. We live in a world where there are countless interactions with computer systems in every-day situations. In the most ideal case, this interaction feels as familiar and as natural as the communication we experience with other humans. To this end, an ideal means of communication between a user and a computer system consists of audiovisual speech signals. Audiovisual text-to-speech technology allows the computer system to utter any spoken message towards its users. Over the last decades, a wide range of techniques for performing audiovisual speech synthesis has been developed. This paper gives a comprehensive overview on these approaches using a categorization of the systems based on multiple important aspects that determine the properties of the synthesized speech signals. The paper makes a clear distinction between the techniques that are used to model the virtual speaker and the techniques that are used to generate the appropriate speech gestures. In addition, the paper discusses the evaluation of audiovisual speech synthesizers, it elaborates on the hardware requirements for performing visual speech synthesis and it describes some important future directions that should stimulate the use of audiovisual speech synthesis technology in real-life applications.

References

[1]

G. Abrantes, F. Pereira, Mpeg-4 facial animation technology: survey, implementation, and results, IEEE Trans. Circ. Syst. Video Technol., 9 (1999) 290-305.

Digital Library

[2]

E. Agelfors, J. Beskow, I. Karlsson, J. Kewley, G. Salvi, N. Thomas, User evaluation of the SYNFACE talking head telephone, in: Computers Helping People with Special Needs, Springer, 2006, pp. 579-586.

[3]

M. Aharon, R. Kimmel, Representation analysis and synthesis of lip images using dimensionality reduction, Int. J. Comput. Vis., 67 (2004) 297-312.

[4]

Albrecht, I., Haber, J., Kahler, K., Schroder, M., Seidel, H.-P., 2002. May i talk to you?:- ) - facial animation from text. In: Proc. Pacific Graphics, pp. 77-86.

[5]

S. Al Moubayed, J. Beskow, B. Granstrom, D. House, Audio-visual prosody: perception, detection, and synthesis of prominence, in: Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces. Theoretical and Practical Issues, Springer, Berlin Heidelberg, 2010, pp. 55-71.

[6]

S. Al Moubayed, J. Beskow, G. Skantze, B. Granstrom, Furhat: a back-projected human-like robot head for multiparty human-machine interaction, Lect. Notes Comput. Sci., 7403 (2012) 114-130.

[7]

Andersen, T.S., 2010. The mcgurk illusion in the oddity task. In: Proc. International Conference on Auditory-Visual Speech Processing, pp. paper S2-3.

[8]

J. Anderson, J. Davis, An Introduction to Neural Networks, MIT Press, 1995.

[9]

Anderson, R., Stenger, B., Wan, V., Cipolla, R., 2013. Expressive visual text-to-speech using active appearance models. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3382-3389.

[10]

Arb, H.A., 2001. Hidden Markov Models for Visual Speech Synthesis in Limited Data Environments. Ph.D. Thesis, Air Force Institute of Technology.

[11]

M. Argyle, M. Cook, Gaze and Mutual Gaze, Cambridge University Press, 1976.

[12]

L.M. Arslan, D. Talkin, Codebook based face point trajectory synthesis algorithm using speech input, Speech Commun., 27 (1999) 81-93.

Digital Library

[13]

Badin, P., Youssef, A., Bailly, G., Elisei, F., Hueber, T., 2010. Visual articulatory feedback for phonetic correction in second language learning. In: Proc. Workshop on Second Language Studies: Acquisition, Learning, Education and Technology, pp. 1-10.

[14]

Bailly, G., Gibert, G., Odisio, M., 2002. Evaluation of movement generation systems using the point-light technique. In: Proc. IEEE Workshop on Speech Synthesis, pp. 27-30.

[15]

G. Bailly, M. Brar, F. Elisei, M. Odisio, Audiovisual speech synthesis, Int. J. Speech Technol., 6 (2003) 331-346.

[16]

J. Barron, D. Fleet, S. Beauchemin, Performance of optical flow techniques, Int. J. Comput. Vis., 12 (1994) 43-77.

Digital Library

[17]

L. Baum, T. Petrie, G. Soules, N. Weiss, A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains, Ann. Math. Stat., 41 (1970) 164-171.

[18]

C. Benoit, B. Le Goff, Audio-visual speech synthesis from French text: eight years of models, designs and evaluation at the ICP, Speech Commun., 26 (1998) 117-129.

Digital Library

[19]

C. Benoît, M. Grice, V. Hazan, The SUS test: a method for the assessment of text-to-speech synthesis intelligibility using semantically unpredictable sentences, Speech Commun., 18 (1996) 381-392.

Digital Library

[20]

C. Benoit, C. Pelachaud, J. Claude Martin, J. Claude Martin, L. Schomaker, B. Suhm, Audio-visual and multimodal speech systems, in: Handbook of Multimodal and Spoken Dialogue Systems: Resources, Terminology and Product Evaluation, Kluwer Academic, 2000.

[21]

M.M. Benoit, T. Raij, F.-H. Lin, I.P. Jskelinen, S. Stufflebeam, Primary and multisensory cortical activity is correlated with audiovisual percepts, Hum. Brain Mapp., 31 (2010) 526-538.

[22]

Bergeron, P., Lachapelle, P., 1985. Controlling facial expressions and body movements in the computer generated animated short Tony de Peltrie. In: Siggraph Tutorial Notes.

[23]

L.E. Bernstein, S.P. Eberhardt, M.E. Demorest, Single-channel vibrotactile supplements to visual perception of intonation and stress, J. Acoust. Soc. Am., 85 (1989) 397-405.

[24]

L. Bernstein, E. Auer, J. Moore, Audiovisual speech binding: convergence or association, in: The Handbook of Multisensory Processes, MIT Press, 2004, pp. 203-223.

[25]

Beskow, J., 1995. Rule-based visual speech synthesis. In: Proc. European Conference on Speech Communication and Technology, pp. 299-302.

[26]

J. Beskow, Trainable articulatory control models for visual speech synthesis, Int. J. Speech Technol., 7 (2004) 335-349.

[27]

Beskow, J., Nordenberg, M., 2005. Data-driven synthesis of expressive visual speech using an mpeg-4 talking head. In: Proc. Annual Conference of the International Speech Communication Association (Interspeech), pp. 793-796.

[28]

Beutnagel, M., Conkie, A., Schroeter, J., Stylianou, Y., Syrdal, A., 1999. The AT&T next-gen TTS system. In: Proc. Joint Meeting of ASA, EAA, and DAGA, pp. 18-24.

[29]

Birkholz, P., Jackel, D., Kroger, K., 2006. Construction and control of a three-dimensional vocal tract model. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 873-876.

[30]

Black, A., Tokuda, K., 2005. The blizzard challenge 2005: evaluating corpus-based speech synthesis on common databases. In: Proc. Annual Conference of the International Speech Communication Association (Interspeech), 2005, pp. 77-80.

[31]

Black, A., Taylor, P., Caley, R., 2014. The Festival Speech Synthesis System. <http://www.cstr.ed.ac.uk/projects/festival.html>.

[32]

V. Blanz, C. Basso, T. Poggio, T. Vetter, Reanimating faces in images and video, Comput. Graph. Forum, 22 (2003) 641-650.

[33]

Bozkurt, E., Erdem, C., Erzin, E., Erdem, T., Ozkan, M., 2007. Comparison of phoneme and viseme based acoustic units for speech driven realistic lip animation. In: Proc. Signal Processing and Communications Applications, pp. 1-4.

[34]

Brand, M., 1999. Voice puppetry. In: Proc. Annual Conference on Computer Graphics and Interactive Techniques, pp. 21-28.

[35]

Breen, A.P., Bowers, E., Welsh, W., 1996. An investigation into the generation of mouth shapes for a talking head. In: Proc. International Conference on Spoken Language Processing, pp. 2159-2162.

[36]

Bregler, C., Covell, M., Slaney, M., 1997. Video rewrite: driving visual speech with audio. In: Proc. Annual Conference on Computer Graphics and Interactive Techniques, pp. 353-360.

[37]

Brooke, N., Scott, S., 1998. Two- and three-dimensional audio-visual speech synthesis. In: Proc. International Conference on Auditory-visual Speech Processing, pp. 213-220.

[38]

N. Brooke, Q. Summerfield, Analysis, synthesis, and perception of visible articulatory movements, J. Phonet., 11 (1983) 63-76.

[39]

Broomhead, D., Lowe, D., 1988. Radial Basis Functions, Multi-variable Functional Interpolation and Adaptive Networks. Tech. rep., Royal Signals and Radar Establishment.

[40]

C.P. Browman, L. Goldstein, Articulatory phonology: an overview, Phonetica, 49 (1992) 155-180.

[41]

Cadavid, S., Abdel-Mottaleb, M., Messinger, D.S., Mahoor, M.H., Bahrick, L.E., 2009. Detecting local audio-visual synchrony in monologues utilizing vocal pitch and facial landmark trajectories. In: Proc. British Machine Conference, pp. 1-11.

[42]

R. Campbell, The processing of audio-visual speech: empirical and neural bases, Philos. Trans. Roy. Soc. Lond., 363 (2008) 1001-1010.

[43]

Cao, Y., Faloutsos, P., Kohler, E., Pighin, F., 2004. Real-time speech motion synthesis from recorded motions. In: Proc. ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 345-353.

[44]

R. Carlson, B. Granström, Data-driven multimodal synthesis, Speech Commun., 47 (2005) 182-193.

[45]

Chang, Y.-J., Ezzat, T., 2005. Transferable videorealistic speech animation. In: Proc. ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 143-151.

[46]

T. Chen, Audiovisual speech processing, IEEE Signal Process. Mag., 18 (2001) 9-21.

[47]

T. Chen, R.R. Rao, Audio-visual integration in multimodal communication, Proc. IEEE, 86 (1998) 837-852.

[48]

M. Cohen, D. Massaro, Synthesis of visible speech, Behav. Res. Methods, 22 (1990) 260-263.

[49]

M.M. Cohen, D.W. Massaro, Modeling coarticulation in synthetic visual speech, in: Models and Techniques in Computer Animation, Springer-Verlag, 1993, pp. 139-156.

[50]

Cohen, M.M., Massaro, D.W., Clark, R., 2002. Training a talking head. In: Proc. 4th IEEE International Conference on Multimodal Interfaces, pp. 499-510.

[51]

T. Cootes, G. Edwards, C. Taylor, Active appearance models, IEEE Trans. Pattern Anal. Mach. Intell., 23 (2001) 681-685.

Digital Library

[52]

Cosatto, E., 2002. Sample-Based Talking-Head Synthesis. Ph.D. Thesis, Swiss Federal Institute of Technology.

[53]

Cosatto, E., Graf, H., 1998. Sample-based synthesis of photo-realistic talking heads. In: Proc. Computer Animation, pp. 103-110.

[54]

E. Cosatto, H.P. Graf, Photo-realistic talking-heads from image samples, IEEE Trans. Multimedia, 2 (2000) 152-163.

Digital Library

[55]

Cosatto, E., Potamianos, G., Graf, H.P., 2000. Audio-visual unit selection for the synthesis of photo-realistic talking-heads. In: Proc. IEEE International Conference on Multimedia and Expo, pp. 619-622.

[56]

E. Cosatto, J. Ostermann, H.P. Graf, J. Schroeter, Lifelike talking faces for interactive services, Proc. IEEE, 91 (2003) 1406-1429.

[57]

Cosi, P., Caldognetto, E., Perin, G., Zmarich, C., 2002. Labial coarticulation modeling for realistic facial animation. In: Proc. IEEE International Conference on Multimodal Interfaces, pp. 505-510.

[58]

Cosi, P., Fusaro, A., Tisato, G., 2003. Lucia: a new italian talking-head based on a modified Cohen-Massaro's labial coarticulation model. In: Proc. European Conference on Speech Communication and Technology, pp. 127-132.

[59]

Cosker, D., Marshall, D., Rosin, P., Hicks, Y., 2003. Video realistic talking heads using hierarchical non-linear speech-appearance models. In: Proc. Mirage, pp. 2-7.

[60]

Cosker, D., Paddock, S., Marshall, D., Rosin, P.L., Rushton, S., 2004. Towards perceptually realistic talking heads: models, methods and mcgurk. In: Proc. Applied Perception in Graphics and Visualization, pp. 151-157.

[61]

Costa, P., De Martino, J., 2010. Compact 2d facial animation based on context-dependent visemes. In: Proc. ACM/SSPNET International Symposium on Facial Analysis and Animation, pp. 20-20.

[62]

Cyberware Scanning Products, 2014. <http://www.cyberware.com/>.

[63]

Deena, S., Hou, S., Galata, A., 2010. Visual speech synthesis by modelling coarticulation dynamics using a non-parametric switching state-space model. In: Proc. International Conference on Multimodal Interfaces, pp. 1-8.

[64]

D. Dehn, S. Van Mulken, The impact of animated interface agents: a review of empirical research, Int. J. Hum. Comput. Stud., 52 (2000) 1-22.

Digital Library

[65]

J. Deller, J. Proakis, J. Hansen, Discrete-Time Processing of Speech Signals, Macmillan Publishing Company, 1993.

[66]

J. De Martino, L. Pini Magalhaes, F. Violaro, Facial animation based on context-dependent visemes, Comput. Graph., 30 (2006) 971-980.

Digital Library

[67]

G. Demeny, Les photographies parlantes, La Nat., 1 (1892) 311.

[68]

Z. Deng, U. Neumann, Expressive speech animation synthesis with phoneme-level controls, Comput. Graph. Forum, 27 (2008) 2096-2113.

[69]

Z. Deng, J. Noh, Computer facial animation: a survey, in: Data-Driven 3D Facial Animation, Springer, 2007, pp. 1-28.

[70]

Z. Deng, U. Neumann, J. Lewis, T. Kim, M. Bulut, S. Narayanan, Expressive facial animation synthesis by learning speech co-articulation and expression, IEEE Trans. Visual. Comput. Graph., 12 (2006) 2006.

[71]

Dey, P., Maddock, S., Nicolson, R., 2010. Evaluation of a viseme-driven talking head. In: Proc. Theory and Practice of Computer Graphics, pp. 139-142.

[72]

N. Dixon, H. Maxey, Terminal analog synthesis of continuous speech using the diphone method of segment assembly, IEEE Trans. Audio Electroacoust., 16 (1968) 40-50.

[73]

Y. Du, X. Lin, Realistic mouth synthesis based on shape appearance dependence mapping, Pattern Recogn. Lett., 23 (2002) 1875-1885.

Digital Library

[74]

H. Dudley, R. Riesz, S. Watkins, A synthetic speaker, J. Franklin Inst., 227 (1939) 739-764.

[75]

H.K. Dunn, The calculation of vowel resonances, and an electrical vocal tract, J. Acoust. Soc. Am., 22 (1950) 740-753.

[76]

Dutoit, T., 1996. The mbrola project: towards a set of high quality speech synthesizers free of use for non commercial purposes. In: Proc. Fourth International Conference on Spoken Language, pp. 1393-1396.

[77]

T. Dutoit, An Introduction to Text-to-Speech Synthesis, Kluwer Academic, 1997.

[78]

Edge, J.D., Hilton, A., 2006. Visual speech synthesis from 3d video. In: Proc. European Conference Visual Media Production, pp. 174-179.

[79]

Edge, J., Maddock, S., 2001. Expressive visual speech using geometric muscle functions. In: Proc. Eurographics UK, pp. 11-18.

[80]

Eisert, P., Chaudhuri, S., Girod, B., 1997. Speech driven synthesis of talking head sequences. In: Proc. Workshop 3D Image Analysis and Synthesis, pp. 51-56.

[81]

P. Ekman, W. Friesen, Facial Action Coding System (FACS): A Technique for the Measurement of Facial Action, Consulting Psychologists Press, Stanford University, 1978.

[82]

P. Ekman, W.V. Friesen, P. Ellsworth, Emotion in the Human Face: Guidelines for Research and an Integration of Findings, Pergamon Press, 1972.

[83]

Elisei, F., Odisio, M., B.G., Badin, P., 2001. Creating and controlling video-realistic talking heads. In: Proc. Auditory-Visual Speech Processing Workshop, pp. 90-97.

[84]

R. El Kaliouby, P. Robinson, Real-time inference of complex mental states from facial expressions and head gestures, in: Real-Time Vision for Human-Computer Interaction, Springer, 2005, pp. 181-200.

[85]

G. Englebienne, T. Cootes, M. Rattray, A probabilistic model for generating realistic lip movements from speech, in: Advances in Neural Information Processing Systems, vol. 20, MIT Press, 2008, pp. 401-408.

[86]

Engwall, O., 2001. Making the tongue model talk: merging MRI & EMA measurements. In: Proc. Eurospeech, vol. 1, pp. 261-264.

[87]

Engwall, O., 2002. Evaluation of a system for concatenative articulatory visual speech synthesis. In: Proc. International Conference on Spoken Language Processing, pp. 665-668.

[88]

Engwall, O., Wik, P., Beskow, J., Granstrom, G., 2004. Design strategies for a virtual language tutor. In: Proc. of International Conference on Spoken Language Processing, vol. 3, pp. 1693-1696.

[89]

N.P. Erber, Auditory-visual perception of speech, J. Speech Hear. Disord., 40 (1975) 481-492.

[90]

N.P. Erber, C.L.D. Filippo, Voice/mouth synthesis and tactual/visual perception of /pa, ba, ma/, J. Acoust. Soc. Am., 64 (1978) 1015-1019.

[91]

Escher, M., Thalmann, N., 1997. Automatic 3d cloning and real-time animation of a human face. In: Proc. Computer Animation, pp. 58-66.

[92]

T. Ezzat, T. Poggio, Visual speech synthesis by morphing visemes, Int. J. Comput. Vis. (SI: Learning and Vision at the Center for Biological and Computational Learning), 38 (2000) 45-57.

Digital Library

[93]

Ezzat, T., Geiger, G., Poggio, T., 2002. Trainable videorealistic speech animation. In: Proc. Annual Conference on Computer Graphics and Interactive Techniques, pp. 388-398.

[94]

Faceware Technologies, 2014. <http://facewaretech.com/>.

[95]

Fagel, S., 2006. Joint audio-visual unit selection - the JAVUS speech synthesizer. In: Proc. International Conference on Speech and Computer, pp. 503-506.

[96]

S. Fagel, C. Clemens, An articulation model for audiovisual speech synthesis - determination, adjustment, evaluation, Speech Commun., 44 (2004) 141-154.

[97]

Fant, G., 1953. Speech Communication Research, Tech. rep., Royal Swedish Academy of Engineering Sciences.

[98]

B. Fasel, J. Luettin, Automatic facial expression analysis: a survey, Pattern Recogn., 36 (2003) 259-275.

[99]

C. Fisher, Confusions among visually perceived consonants, J. Speech Hear. Res., 11 (1968) 796-804.

[100]

C.G. Fisher, The visibility of terminal pitch contour, J. Speech Hear. Res., 12 (1969) 379-382.

[101]

Galanes, F., Unverferth, J., Arslan, L., Talkin, D., 1998. Generation of lip-synched synthetic faces from phonetically clustered face movement data. In: Proc. International Conference on Auditory-Visual Speech Processing, pp. 191-194.

[102]

Gao, L., Mukigawa, Y., Ohta, Y., 1998. Synthesis of facial images with lip motion from several real views. In: Proc. IEEE International Conference on Automatic Face and Gesture Recognition, pp. 181-186.

[103]

Geiger, G., Ezzat, T., Poggio, T., 2003. Perceptual Evaluation of Video-Realistic Speech. Tech. rep., MIT Artificial Intelligence Laboratory.

[104]

Gibbs, S., Breiteneder, C., De Mey, V., Papathomas, M., 1993. Video widgets and video actors. In: Proc. ACM Symposium on User Interface Software and Technology, pp. 179-185.

[105]

M.S. Gordon, M. Hibberts, Audiovisual speech from emotionally expressive and lateralized faces, Quart. J. Exp. Psychol., 64 (2011) 730-750.

[106]

T. Goto, S. Kshirsagar, N. Magnenat-Thalmann, Automatic face cloning and animation using real-time facial feature tracking and speech acquisition, IEEE Signal Process. Mag., 18 (2001) 17-25.

[107]

Govokhina, O., Bailly, G., Breton, G., Bagshaw, P., 2006. Evaluation de systèmes de génération de mouvements faciaux. In: Proc. Journées d'Etudes sur la Parole, pp. 305-308.

[108]

Govokhina, O., Bailly, G., Breton, G., Bagshaw, P.C., 2006. TDA: a new trainable trajectory formation system for facial animation. In: Proc. Annual Conference of the International Speech Communication Association (Interspeech), pp. 2474-2477.

[109]

Goyal, U., Kapoor, A., Kalra, P., 2000. Text-to-audiovisual speech synthesizer. In: Proc. International Conference on Virtual Worlds, pp. 256-269.

[110]

Graf, H.P., Cosatto, E., Strom, V., Huang, F.J., 2002. Visual prosody: facial movements accompanying speech. In: Proc. International Conference on Automatic Face and Gesture Recognition, pp. 396-401.

[111]

B. Granstrom, D. House, Audiovisual representation of prosody in expressive speech communication, Speech Commun., 46 (2005) 473-484.

[112]

Granstrom, B., House, D., Lundeberg, M., 1999. Prosodic cues in multimodal speech perception. In: Proc. International Congress of Phonetic Sciences, pp. 655-658.

[113]

E.C. Grant, Human facial expression, Man, 4 (1969) 525-536.

[114]

K.W. Grant, B.E. Walden, P.F. Seitz, Auditory-visual speech recognition by hearing-impaired subjects: consonant recognition, sentence recognition, and auditory-visual integration, J. Acoust. Soc. Am., 103 (1998) 2677-2690.

[115]

Guiard-Marigny, T., Tsingos, N., Adjoudani, A., Benoit, C., Gascuel, M.-P., 1996. 3D models of the lips for realistic speech animation. In: Proc. Computer Animation, pp. 80-89.

[116]

R. Gutierrez-Osuna, P.K. Kakumanu, A. Esposito, O.N. Garcia, A. Bojorquez, J.L. Castillo, I. Rudomin, Speech-driven facial animation with realistic dynamics, IEEE Trans. Multimedia, 7 (2005) 33-42.

Digital Library

[117]

U. Hadar, T.J. Steiner, E.C. Grant, F.C. Rose, Head movement correlates of juncture and stress at sentence level, Lang. Speech, 26 (1983) 117-129.

[118]

Hallgren, A., Lyberg, B., 1998. Visual speech synthesis with concatenative speech. In: Proc. Auditory Visual Speech Processing, pp. 181-183.

[119]

P. Heckbert, Survey of texture mapping, IEEE Comput. Graph. Appl., 6 (1986) 56-67.

Digital Library

[120]

D.R. Hill, A. Pearce, B. Wyvill, Animating speech: an automated approach using speech synthesised by rules, Visual Comput., 3 (1988) 277-289.

[121]

P. Hong, Z. Wen, T. Huang, iFACE: a 3d synthetic talking face, Int. J. Image Graph., 1 (2001) 19-26.

[122]

B.K.P. Horn, B.G. Schunck, Determining optical flow, Artif. Intell., 17 (1981) 185-203.

Digital Library

[123]

C. Hsieh, Y. Chen, Partial linear regression for speech-driven talking head application, Signal Process.: Image Commun., 21 (2006) 1-12.

[124]

Huang, F.J., Cosatto, E., Graf, H.P., 2002. Triphone based unit selection for concatenative visual speech synthesis. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 2037-2040.

[125]

Hunt, A.J., Black, A.W., 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 373-376.

[126]

A. Hyvarinen, J. Karhunen, E. Oja, Independent Component Analysis, Wiley & Sons, 2001.

[127]

H.H.S. Ip, L. Yin, Constructing a 3d individualized head model from two orthogonal views, Vis. Comput., 12 (1996) 254-266.

[128]

J. Jeffers, M. Barley, Speechreading (Lipreading), Charles C Thomas Pub. Ltd, 1971.

[129]

D. Jiang, I. Ravyse, H. Sahli, W. Verhelst, Speech driven realistic mouth animation based on multi-modal unit selection, J. Multimodal User Interfaces, 2 (2008) 157-169.

[130]

W.L. Johnson, J.W. Rickel, J.C. Lester, Animated pedagogical agents: face-to-face interaction in interactive learning environments, Int. J. Artif. Intell. Educ., 11 (2000) 47-78.

[131]

Kahler, K., Haber, J., Seidel, H.-P., 2001. Geometry-based muscle modeling for facial animation. In: Proc. Graphics Interface, pp. 37-46.

[132]

Kalberer, G., Van Gool, L., 2001. Face animation based on observed 3d speech dynamics. In: Proc. Computer Animation, pp. 20-251.

[133]

P. Keating, Underspecification in phonetics, Phonology, 5 (1988) 275-292.

[134]

J. Kelly, L. Gerstman, An artificial talker driven from a phonetic input, J. Acoust. Soc. Am., 33 (1961).

[135]

Kelly, J., Lochbaum, C., 1962. Speech synthesis. In: Proc. Fourth International Conference on Acoustics, pp. 1-4.

[136]

R. Kent, F. Minifie, Coarticulation in recent speech production models, J. Phonet., 5 (1977) 115-133.

[137]

Khronos, 2014. Webgl. <http://www.khronos.org/registry/webgl/specs/latest/>.

[138]

Kim, I., Ko, H., 2007. 3D lip-synch generation with data-faithful machine learning. In: Proc. Computer Graphics Forum, vol. 26, pp. 295-301.

[139]

S. King, R. Parent, Creating speech-synchronized animation, IEEE Trans. Visual Comput. Graph., 11 (2005) 341-352.

Digital Library

[140]

D.H. Klatt, Software for a cascade/parallel formant synthesizer, J. Acoust. Soc. Am., 67 (1980) 971-995.

[141]

D. Klatt, Review of text-to-speech conversion for english, J. Acoust. Soc. Am., 82 (1987) 737-793.

[142]

G. Klir, B. Yuan, Fuzzy Sets and Fuzzy Logic, Prentice Hall, 1995.

[143]

Krahmer, E., Ruttkay, Z., Swerts, M., Wesselink, W., 2002. Pitch, eyebrows and the perception of focus. In: Proc. Speech Prosody, pp. 443-446.

[144]

S. Kshirsagar, N. Magnenat-Thalmann, Visyllable based speech animation, Comput. Graph. Forum, 22 (2003) 631-639.

[145]

Kuratate, T., Yehia, H., Vatikiotis-bateson, E., 1998. Kinematics-based synthesis of realistic talking faces. In: Proc. International Conference on Auditory-Visual Speech Processing, pp. 185-190.

[146]

Kuratate, T., Pierce, B., Cheng, G., 2011. Mask-bot: a life-size talking head animated robot for AV speech and human-robot communication research. In: Proc. International Conference on Auditory-Visual Speech Processing, pp. 111-116.

[147]

W. Lawrence, The synthesis of speech from signals which have a low information rate, in: Communication Theory, Butterworths, London, 1953, pp. 460-469.

[148]

Lee, Y., Terzopoulos, D., Waters, K., 1995. Realistic modeling for facial animation. In: Proc. Annual Conference on Computer Graphics and Interactive Techniques, pp. 55-62.

[149]

Le Goff, B., 1997. Automatic modeling of coarticulation in text-to-visual speech synthesis. In: Proc. European Conference on Speech Communication and Technology, pp. 1667-1670.

[150]

Le Goff, B., Benoit, C., 1996. A text-to-audiovisual-speech synthesizer for French. In: Proc. International Conference on Spoken Language Processing, pp. 2163-2166.

[151]

Lei, X., Dongmei, J., Ravyse, I., Verhelst, W., Sahli, H., Slavova, V., Rongchun, Z., 2003. Context dependent viseme models for voice driven animation. In: Proc. EURASIP Conference Focused on Video/Image Processing and Multimedia Communications, pp. 649-654.

[152]

J. Lewis, Automated lip-sync: background and techniques, J. Visual. Comput. Anim., 2 (1991) 118-122.

[153]

Lewis, J.P., Parke, F.I., 1987. Automated lip-synch and speech synthesis for character animation. In: Proc. SIGCHI/GI Conference on Human Factors in Computing Systems and Graphics Interface, pp. 143-147.

[154]

D. Lindsay, Talking head, Am. Heritage Invent. Technol. Summer, 1997 (1997) 57-63.

[155]

Lin, I.-C., Hung, C.-S., Yang, T.-J., Ouhyoung, M., 1999. A speech driven talking head system based on a single face image. In: Proc. Conference on Computer Graphics and Applications, pp. 43-49.

[156]

K. Liu, J. Ostermann, Optimization of an image-based talking head system, EURASIP J. Audio, Speech Music Process. (SI: Animating Virtual Speakers or Singers from Audio: Lip-Synching Facial Animation) (2009) 174192.

[157]

Liu, K., Ostermann, J., 2009. Minimized database of unit selection in visual speech synthesis without loss of naturalness. In: Proc. International Conference on Computer Analysis of Images and Patterns, pp. 1212-1219.

[158]

Liu, K., Ostermann, J., 2011. Realistic facial expression synthesis for an image-based talking head. In: Proc. IEEE International Conference on Multimedia and Expo, pp. 1-6.

[159]

S. Lloyd, Least squares quantization in PCM, IEEE Trans. Inf. Theory, 25 (1982) 129-137.

[160]

A. Lofqvist, Speech as audible gestures, in: Speech Production and Speech Modeling, Kluwer Academic Publishers, 1990, pp. 289-322.

[161]

A. MacLeod, Q. Summerfield, Quantifying the contribution of vision to speech perception in noise, Br. J. Audiol., 21 (1987) 131-141.

[162]

A. MacLeod, Q. Summerfield, A procedure for measuring auditory and audio-visual speech-reception thresholds for sentences in noise: rationale, evaluation, and recommendations for use, Br. J. Audiol., 24 (1990) 29-43.

[163]

J. Ma, R. Cole, B. Pellom, W. Ward, B. Wise, Accurate visible speech synthesis based on concatenating variable length motion capture data, IEEE Trans. Vis. Comput. Graph., 12 (2006) 266-276.

Digital Library

[164]

W.J. Ma, X. Zhou, L.A. Ross, J.J. Foxe, L.C. Parra, Lip-reading aids word recognition most in moderate noise: a bayesian explanation using high-dimensional feature space, PLoS One, 4 (2009) e4638.

[165]

M. Malcangi, Text-driven avatars based on artificial neural networks and fuzzy logic, Int. J. Comput., 4 (2010) 61-69.

[166]

Massaro, D.W., 2003. A computer-animated tutor for spoken and written language learning. In: Proc. International Conference on Multimodal Interfaces, pp. 172-175.

[167]

D. Massaro, M. Cohen, Perception of synthesized audible and visible speech, Psychol. Sci., 1 (1990) 55-63.

[168]

Massaro, D., Beskow, J., Cohen, M., Fry, C., Rodgriguez, T., 1999. Picture my voice: Audio to visual speech synthesis using artificial neural networks. In: Proc. International Conference on Auditory-Visual Speech Processing, pp. 133-138.

[169]

D.W. Massaro, M.M. Cohen, M. Tabain, J. Beskow, R. Clark, Animated speech: research progress and applications, in: Audiovisual Speech Processing, Cambridge University Press, 2012, pp. 309-345.

[170]

Mattheyses, W., 2013. A Multimodal Approach to Audiovisual Text-To-Speech Synthesis. Ph.D. Thesis, Vrije Universiteit Brussel.

[171]

W. Mattheyses, L. Latacz, W. Verhelst, H. Sahli, Multimodal unit selection for 2d audiovisual text-to-speech synthesis, Lect. Notes Comput. Sci., 5237 (2008) 125-136.

[172]

W. Mattheyses, L. Latacz, W. Verhelst, On the importance of audiovisual coherence for the perceived quality of synthesized visual speech, EURASIP J. Audio Speech Music Process. (SI: Animating Virtual Speakers or Singers from Audio: Lip-Synching Facial Animation) (2009) 169819.

[173]

Mattheyses, W., Latacz, L., Verhelst, W., 2009. Multimodal coherency issues in designing and optimizing audiovisual speech synthesis techniques. In: Proc. International Conference on Auditory-visual Speech Processing, pp. 47-52.

[174]

Mattheyses, W., Latacz, L., Verhelst, W., 2010. Optimized photorealistic audiovisual speech synthesis using active appearance modeling. In: Proc. International Conference on Auditory-visual Speech Processing, pp. 148-153.

[175]

Mattheyses, W., Latacz, L., Verhelst, W., 2010. Active appearance models for photorealistic visual speech synthesis. In: Proc. Annual Conference of the International Speech Communication Association (Interspeech), pp. 1113-1116.

[176]

Mattheyses, W., Latacz, L., Verhelst, W., 2011. Auditory and photo-realistic audiovisual speech synthesis for dutch. In: Proc. International Conference on Auditory-Visual Speech Processing, pp. 55-60.

[177]

Mattheyses, W., Latacz, L., Verhelst, W., 2011. Automatic viseme clustering for audiovisual speech synthesis. In: Proc. Annual Conference of the International Speech Communication Association (Interspeech), pp. 2173-2176.

[178]

W. Mattheyses, L. Latacz, W. Verhelst, Comprehensive many-to-many phoneme-to-viseme mapping and its application for concatenative visual speech synthesis, Speech Commun., 55 (2013) 857-876.

Digital Library

[179]

H. McGurk, J. MacDonald, Hearing lips and seeing voices, Nature, 264 (1976) 746-748.

[180]

J. Melenchon, E. Martinez, F. De La Torre, J. Montero, Emphatic visual speech synthesis, IEEE Trans. Audio Speech Lang. Process., 17 (2009) 459-468.

Digital Library

[181]

P. Mermelstein, Distance measures for speech recognition, psychological and instrumental, Pattern Recogn. Artif. Intell., 116 (1976) 91-103.

[182]

Minnis, S., Breen, A.P., 2000. Modeling visual coarticulation in synthetic talking heads using a lip motion unit inventory with concatenative synthesis. In: Proc. International Conference on Spoken Language Processing, pp. 759-762.

[183]

A.A. Montgomery, Development of a model for generating synthetic animated lip shapes, J. Acoust. Soc. Am., 68 (1980) S58-S59.

[184]

M. Mori, The uncanny valley, Energy, 7 (1970) 33-35.

[185]

MPEG, 2013. ISO-IEC-14496-2. <http://www.iso.org>.

[186]

P. Muller, G. Kalberer, M. Proesmans, L. Van Gool, Realistic speech animation based on observed 3d face dynamics, IEE Proc. Vis. Image Signal Process., 152 (2005) 491-500.

[187]

K.G. Munhall, J.A. Jones, D.E. Callan, T. Kuratate, E. Vatikiotis-Bateson, Visual prosody and speech intelligibility: head movement improves auditory speech perception, Psychol. Sci., 15 (2004) 133-137.

[188]

Musti, U., Colotte, V., Toutios, A., Ouni, S., 2011. Introducing visual target cost within an acoustic-visual unit-selection speech synthesizer. In: Proc. International Conference on Auditory-Visual Speech Processing, pp. 49-55.

[189]

C. Myers, L. Rabiner, A. Rosenberg, Performance tradeoffs in dynamic time warping algorithms for isolated word recognition, IEEE Trans. Acoust. Speech Signal Process., 28 (1980) 623-635.

[190]

Noh, J., Neumann, U., 2000. Talking faces. In: Proc. IEEE International Conference on Multimedia and Expo, vol. 2, pp. 627-630.

[191]

T. Noma, L. Zhao, N.I. Badler, Design of a virtual human presenter, Comput. Graph. Appl., 20 (2000) 79-85.

Digital Library

[192]

Nuance, 2014. <www.nuance.com>.

[193]

Nuance Vocalizer Expressive, 2014. <http://www.nuance.com/for-business/mobile-solutions/vocalizer-expressive/index.htm>.

[194]

S.E. Ohman, Numerical model of coarticulation, J. Acoust. Soc. Am., 41 (1967) 310-320.

[195]

Ostermann, J., 1998. Animation of synthetic faces in MPEG-4. In: Proc. Computer Animation, pp. 49-55.

[196]

Ostermann, J., Millen, D., 2000. Talking heads and synthetic speech: an architecture for supporting electronic commerce. In: Proc. IEEE International Conference on Multimedia and Expo, pp. 71-74.

[197]

J. Ostermann, L. Chen, T. Huang, Animated talking head with personalized 3d head model, J. VLSI Sig. Proc., 20 (1998) 97-105.

Digital Library

[198]

S. Ouni, M. Cohen, H. Ishak, D. Massaro, Visual contribution to speech perception: measuring the intelligibility of animated talking heads, EURASIP J. Audio Speech Music Process., 2007 (2006) 047891.

[199]

S. Ouni, V. Colotte, U. Musti, A. Toutios, B. Wrobel-Dautcourt, M.-O. Berger, C. Lavecchia, Acoustic-visual synthesis technique using bimodal unit-selection, EURASIP J. Audio Speech Music Process., 2013 (2013) 1-13.

[200]

I. Pandzic, R. Forchheimer, MPEG-4 Facial Animation: The Standard, Implementation and Applications, John Wiley & Sons Inc., 2003.

[201]

I.S. Pandzic, J. Ostermann, D.R. Millen, User evaluation: synthetic talking faces for interactive services, Vis. Comput., 15 (1999) 330-340.

Digital Library

[202]

Parke, F.I., 1972, Computer generated animation of faces. In: Proc. ACM Annual Conference, pp. 451-457.

[203]

F.I. Parke, A model for human faces that allows speech synchronized animation, Comput. Graph., 1 (1975) 3-4.

[204]

F. Parke, Parametric models for facial animation, Comput. Graph. Appl., 2 (1982) 61-68.

Digital Library

[205]

Pearce, A., Wyvill, B., Wyvill, G., Hill, D., 1986. Speech and expression: a computer solution to face animation. In: Proc. Graphics and Vision Interface, pp. 136-140.

[206]

K. Pearson, On lines and planes of closest fit to systems of points in space, Philos. Mag. J. Sci., 2 (1901) 559-572.

[207]

Pelachaud, C., 1991. Communication and Coarticulation in Facial Animation. Ph.D. Thesis, University of Pennsylvania.

[208]

Pelachaud, C., Badler, N.I., Steedman, M., 1991. Linguistic issues in facial animation. In: Proc. Computer Animation, pp. 15-30.

[209]

C. Pelachaud, N. Badler, M. Steedman, Generating facial expressions for speech, Cogn. Sci., 20 (1996) 1-46.

[210]

Pelachaud, C., Magno-Caldognetto, E., Zmarich, C., Cosi, P., 2001. Modelling an italian talking head. In: Proc. International Conference on Auditory-Visual Speech Processing, pp. 72-77.

[211]

Perng, W., Wu, Y., Ouhyoung, M., 1998. Image talk: a real time synthetic talking head using one single image with chinese text-to-speech capability. In: Proc. Pacific Conference on Computer Graphics and Applications, pp. 140-148.

[212]

Pighin, F., Hecker, J., Lischinski, D., Szeliski, R., Salesin, D.H., 1998. Synthesizing realistic facial expressions from photographs. In: Proc. Annual Conference on Computer Graphics and Interactive Techniques, pp. 75-84.

[213]

Pixar Animation Studios, 2014. <http://www.pixar.com/>.

[214]

S.M. Platt, N.I. Badler, Animating facial expressions, Comput. Graph., 15 (1981) 245-252.

Digital Library

[215]

T. Porter, T. Duff, Compositing digital images, SIGGRAPH Comput. Graph., 18 (1984) 253-259.

Digital Library

[216]

Qualisys, 2014. <http://www.qualisys.com/applications/media-and-entertainment/>.

[217]

L. Rabiner, B.-H. Juang, Fundamentals of Speech Recognition, Prentice Hall, 1993.

[218]

L. Rabiner, R. Schafer, Digital Processing of Speech Signals, Prentice-Hall Englewood Cliffs, 1978.

[219]

Reveret, L., Bailly, G., Badin, P., 2000. Mother: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation. In: Proc. International Conference on Spoken Language Processing, pp. 755-758.

[220]

Ritter, M., Meier, U., Yang, J., Waibel, A., 1999. Face translation: a multimodal translation agent. In: Proc. International Conference on Auditory-Visual Speech Processing, p. paper 28.

[221]

J. Ronnberg, S. Samuelsson, B. Lyxell, Conceptual constraints in sentence-based lipreading in the hearing-impaired, in: Hearing by Eye II: Advances in the Psychology of Speechreading and Auditory-Visual Speech, Psychology Press, 1998, pp. 143-153.

[222]

G. Rosen, Dynamic analog speech synthesizer, J. Acoust. Soc. Am., 30 (1958) 201-209.

[223]

S. Roweis, EM algorithms for PCA and SPCA, Adv. Neural Inform. Process. Syst., 10 (1998) 626-632.

[224]

D. Schabus, M. Pucher, G. Hofer, Joint audiovisual hidden semi-markov model-based speech synthesis, IEEE J. Sel. Top. Signal Process. (2013) 99.

[225]

K.L. Schmidt, J.F. Cohn, Human facial expressions as adaptations: evolutionary questions in facial expression research, Am. J. Phys. Anthropol., S33 (2001) 3-24.

[226]

M. Schroder, J. Trouvain, The german text-to-speech synthesis system MARY: a tool for research, development and teaching, Int. J. Speech Technol., 6 (2003) 365-377.

[227]

M. Schroeder, A brief history of synthetic speech, Speech Commun., 13 (1993) 231-237.

Digital Library

[228]

Schroeter, J., Ostermann, J., Graf, H.P., Beutnagel, M.C., Cosatto, E., Syrdal, A.K., Conkie, A., Stylianou, Y., 2000. Multimodal speech synthesis. In: Proc. IEEE International Conference on Multimedia and Expo, pp. 571-578.

[229]

J.-L. Schwartz, F. Berthommier, C. Savariaux, Seeing to hear better: evidence for early audio-visual interactions in speech identification, Cognition, 93 (2004) 69-78.

[230]

Schwippert, C., Benoit, C., 1997. Audiovisual intelligibility of an androgynous speaker. In: Proc. International Conference on Auditory-visual Speech Processing, pp. 81-84.

[231]

Scott, K., Kagels, D., Watson, S., Rom, H., Wright, J., Lee, M., Hussey, K., 1994. Synthesis of speaker facial movement to match selected speech sequences. In: Proc. Australian Conference on Speech Science and Technology, pp. 620-625.

[232]

Second Life, 2014. <http://secondlife.com/>.

[233]

Shiraishi, T., Toda, T., Kawanami, H., Saruwatari, H., Shikano, K., 2003. Simple designing methods of corpus-based visual speech synthesis. In: Proc. Annual Conference of the International Speech Communication Association (Interspeech), pp. 2241-2244.

[234]

E. Sifakis, I. Neverov, R. Fedkiw, Automatic determination of facial muscle activations from sparse motion capture marker data, ACM Trans. Graph., 24 (2005) 417-425.

Digital Library

[235]

Sifakis, E., Selle, A., Robinson-Mosher, A., Fedkiw, R., 2006. Simulating speech with a physics-based facial muscle model. In: Proc. ACM SIGGRAPH/Eurographics symposium on Computer Animation, pp. 261-270.

[236]

J.I. Skipper, V. van Wassenhove, H.C. Nusbaum, S.L. Small, Hearing lips and seeing voices: how cortical areas supporting speech production mediate audiovisual speech perception, Cereb. Cortex, 17 (2007) 2387-2399.

[237]

L. Sproull, M. Subramani, S. Kiesler, J. Walker, K. Waters, When the interface is a face, Hum.-Comput. Interact., 11 (1996) 97-124.

Digital Library

[238]

Q. Summerfield, Lipreading, audio-visual speech perception, Philos. Trans. Roy. Soc. Lond.: Biol. Sci., 335 (1992) 71-78.

[239]

M. Swerts, E. Krahmer, Audiovisual prosody and feeling of knowing, J. Mem. Lang., 53 (2005) 81-94.

[240]

Swerts, M., Krahmer, E., 2006. The importance of different facial areas for signalling visual prominence. In: Proc. Annual Conference of the International Speech Communication Association (Interspeech), pp. paper 1289-Tue3WeO.3.

[241]

Tamura, M., Masuko, T., Kobayashi, T., Tokuda, K., 1998. Visual speech synthesis based on parameter generation from HMM: speech-driven and text-and-speech-driven approaches. In: Proc. International Conference on Auditory-Visual Speech Processing, pp. 221-226.

[242]

Tamura, M., Kondo, S., Masuko, T., Kobayashi, T., 1999. Text-to-audio-visual speech synthesis based on parameter generation from HMM. In: Proc. European Conference on Speech Communication and Technology, pp. 959-962.

[243]

J. Tao, L. Xin, P. Yin, Realistic visual speech synthesis based on hybrid concatenation method, IEEE Trans. Audio Speech Lang. Process., 17 (2009) 469-477.

Digital Library

[244]

P. Taylor, Text-to-Speech Synthesis, Cambridge University Press, 2009.

[245]

Tena, J.R., De la Torre, F., Matthews, I., 2011. Interactive region-based linear 3d face models. In: Proc. ACM SIGGRAPH, pp. 76:1-76:10.

[246]

D. Terzopoulos, K. Waters, Analysis and synthesis of facial image sequences using physical and anatomical models, IEEE Trans. Pattern Anal. Mach. Intell., 15 (1993) 569-579.

Digital Library

[247]

Theobald, B., 2007. Audiovisual speech synthesis. In: Proc. International Congress on Phonetic Sciences, pp. 285-290.

[248]

B. Theobald, I. Matthews, Relating objective and subjective performance measures for AAM-based visual speech synthesis, IEEE Trans. Audio Speech Lang. Process., 20 (2012) 2378-2387.

Digital Library

[249]

Theobald, B., Wilkinson, N., 2007. A real-time speech-driven talking head using active appearance models. In: Proc. International Conference on Auditory-Visual Speech Processing, vol. 7, pp. 22-28.

[250]

Theobald, B., Matthews, I., Glauert, J., Bangham, A., Cawley, G., 2003. 2.5d visual speech synthesis using appearance models. In: Proc. British Machine Vision Conference, pp. 42-52.

[251]

B.-J. Theobald, J.A. Bangham, I.A. Matthews, G.C. Cawley, Near-videorealistic synthetic talking faces: implementation and evaluation, Speech Commun., 44 (2004) 127-140.

[252]

Theobald, B., Fagel, S., Bailly, G., Elisei, F., 2008. Lips2008: visual speech synthesis challenge. In: Proc. Annual Conference of the International Speech Communication Association (Interspeech), pp. 1875-1878.

[253]

Tiddeman, B., Perrett, D., 2002. Prototyping and transforming visemes for animated speech. In: Proc. Computer Animation, pp. 248-251.

[254]

A. Tinwell, M. Grimshaw, D. Nabi, A. Williams, Facial expression of emotion and perception of the uncanny valley in virtual characters, Comput. Hum. Behav., 27 (2011) 741-749.

Digital Library

[255]

Toutios, A., Musti, U., Ouni, S., Colotte, V., Wrobel-Dautcourt, B., Berger, M.-O., 2010. Towards a true acoustic-visual speech synthesis. In: Proc. International Conference on Auditory-Visual Speech Processing.

[256]

Uz, B., Gudukbay, U., Ozguc, B., 1998. Realistic speech animation of synthetic faces. In: Proc. Computer Animation, pp. 111-118.

[257]

M. Van der Schaar, P.A. Chou, Multimedia Over IP and Wireless Networks: Compression, Networking, and Systems, Academic Press, 2011.

[258]

V. Van Wassenhove, K.W. Grant, D. Poeppel, Visual speech speeds up the neural processing of auditory speech, Proc. Nat. Acad. Sci. USA, 102 (2005) 1181-1186.

[259]

E. Vatikiotis-Bateson, I.M. Eigsti, S. Yano, K.G. Munhall, Eye movement of perceivers during audiovisual speech perception, Percept. Psychophys., 60 (1998) 926-940.

[260]

Verma, A., Rajput, N., Subramaniam, L., 2003. Using viseme based acoustic models for speech driven lip synthesis. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 720-723.

[261]

Vicon Systems, 2013. <http://www.vicon.com/>.

[262]

B. Vidakovic, Statistical Modeling by Wavelets, Wiley, 2008.

[263]

Vignoli, F., Braccini, C., 1999. A text-speech synchronization technique with applications to talking heads. In: Proc. International Conference on Auditory-Visual Speech Processing, p. paper 22.

[264]

C. Walder, M. Breidt, H. Bülthoff, B. Schölkopf, C. Curio, Markerless 3d face tracking, in: Pattern Recognition, Springer, 2009, pp. 41-50.

[265]

Walker, J., L. Sproull, R. Subramani, Using a human face in an interface. In: Proc. SIGCHI Conference on Human Factors in Computing Systems, 1994, pp. 85-91.

[266]

Wan, V., Anderson, R., Blokland, A., Braunschweiler, N., Chen, L., Kolluru, B., Latorre, J., Maia, R., Stenger, B., Yanagisawa, K., et al., 2013. Photo-realistic expressive text to talking head synthesis. In: Proc. Annual Conference of the International Speech Communication Association (Interspeech), pp. 2667-2669.

[267]

Wang, L., Han, W., Qian, X., Soong, F., 2010. Photo-real lips synthesis with trajectory-guided sample selection. In: Proc. ISCA Workshop on Speech Synthesis, pp. 217-222.

[268]

K. Waters, A muscle model for animating three-dimensional facial expressions, Comput. Graph., 21 (1987) 17-24.

Digital Library

[269]

Waters, K., Frisbie, J., 1995. A coordinated muscle model for speech animation. In: Proc. Graphics Interface, pp. 163-163.

[270]

Waters, K., Levergood, T., 1993. Decface: An Automatic Lip-Synchronization Algorithm for Synthetic Faces. Tech. rep., DEC Cambridge Research Laboratory.

[271]

Weiss, C., 2004. A framework for data-driven video-realistic audiovisual speech synthesis. In: Proc. International Conference on Language Resources and Evaluation.

[272]

T. Weyrich, W. Matusik, H. Pfister, B. Bickel, C. Donner, C. Tu, J. McAndless, J. Lee, A. Ngan, H.W. Jensen, Analysis of human faces using a measurement-based skin reflectance model, ACM Trans. Graph., 25 (2006) 1013-1024.

Digital Library

[273]

L. Williams, Performance-driven facial animation, Comput. Graph., 24 (1990) 235-242.

[274]

Wilting, J., Krahmer, E., Swerts, M., 2006. Real vs acted emotional speech. In: Proc. Annual Conference of the International Speech Communication Association (Interspeech), 2006, pp. paper 1093-Tue1A3O.4.

[275]

G. Wolberg, Image morphing: a survey, Vis. Comput., 14 (1998) 360-372.

[276]

Yang, J., Xiao, J., Ritter, M., 2000. Automatic selection of visemes for image-based visual speech synthesis. In: Proc. IEEE International Conference on Multimedia and Expo, pp. 1081-1084.

[277]

Ypsilos, I., Hilton, A., Turkmani, A., Jackson, P., 2004. Speech-driven face synthesis from 3d video. In: Proc. 3D Data Processing, Visualization and Transmission Workshop, pp. 58-65.

Digital Library

[278]

M. Zelezny, Z. Krnoul, P. Cisar, J. Matousek, Design, implementation and evaluation of the czech realistic audio-visual speech synthesis, Signal Process., 86 (2006) 3657-3673.

Digital Library

[279]

H. Zen, K. Tokuda, A. Black, Statistical parametric speech synthesis, Speech Commun., 51 (2009) 1039-1064.

Digital Library

Cited By

Xu ZGong STang JLiang LHuang YLi HHuang S(2024)KMTalk: Speech-Driven 3D Facial Animation with Key Motion EmbeddingComputer Vision – ECCV 202410.1007/978-3-031-72992-8_14(236-253)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72992-8_14
Aller SFishel M(2024)Adapting Audiovisual Speech Synthesis to EstonianText, Speech, and Dialogue10.1007/978-3-031-70566-3_2(13-23)Online publication date: 9-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-70566-3_2
Wu HZhou SJia JXing JWen QWen XEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Speech-Driven 3D Face Animation with Composite and Regional Facial MovementsProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611775(6822-6830)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611775
Show More Cited By

Audiovisual speech synthesis
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Speech-Input Speech-Output Communication for Dysarthric Speakers Using HMM-Based Speech Recognition and Adaptive Synthesis System

Dysarthria is a motor speech disorder that causes inability to control and coordinate one or more articulators. This makes it difficult for a dysarthric speaker to utter certain speech sound units, thereby producing poorly articulated, slurred, and ...
Audiovisual Speech Synthesis using Tacotron2
ICMI '21: Proceedings of the 2021 International Conference on Multimodal Interaction

Audiovisual speech synthesis involves synthesizing a talking face while maximizing the coherency of the acoustic and visual speech. To solve this problem, we propose using AVTacotron2, which is an end-to-end text-to-audiovisual speech synthesizer based ...
Speech Synthesis Research Based on EGG
GREENCOM-ITHINGS-CPSCOM '13: Proceedings of the 2013 IEEE International Conference on Green Computing and Communications and IEEE Internet of Things and IEEE Cyber, Physical and Social Computing

Objective: In this paper, Electroglottograph (EGG) is adopted to improve the naturalness of low bit-rate formant speech synthesis. Methods: EGG inverting waveform is used as glottal excitation of the formant speech synthesis. A new SUV divided method ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Speech Communication

Speech Communication Volume 66, Issue C

February 2015

243 pages

ISSN:0167-6393

Issue’s Table of Contents

Copyright © Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 February 2015

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xu ZGong STang JLiang LHuang YLi HHuang S(2024)KMTalk: Speech-Driven 3D Facial Animation with Key Motion EmbeddingComputer Vision – ECCV 202410.1007/978-3-031-72992-8_14(236-253)Online publication date: 29-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-72992-8_14
Aller SFishel M(2024)Adapting Audiovisual Speech Synthesis to EstonianText, Speech, and Dialogue10.1007/978-3-031-70566-3_2(13-23)Online publication date: 9-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-70566-3_2
Wu HZhou SJia JXing JWen QWen XEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Speech-Driven 3D Face Animation with Composite and Regional Facial MovementsProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3611775(6822-6830)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3611775
Almeida NCunha DSilva STeixeira A(2021)Designing and Deploying an Interaction Modality for Articulatory-Based Audiovisual Speech SynthesisSpeech and Computer10.1007/978-3-030-87802-3_4(36-49)Online publication date: 27-Sep-2021
https://dl.acm.org/doi/10.1007/978-3-030-87802-3_4
Tzirakis PPapaioannou ALattas ATarasiou MSchuller BZafeiriou S(2020)Synthesising 3D Facial Motion from “In-the-Wild” Speech2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)10.1109/FG47880.2020.00100(265-272)Online publication date: 16-Nov-2020
https://dl.acm.org/doi/10.1109/FG47880.2020.00100
Wang KWu QSong LYang ZWu WQian CHe RQiao YLoy C(2020)MEAD: A Large-Scale Audio-Visual Dataset for Emotional Talking-Face GenerationComputer Vision – ECCV 202010.1007/978-3-030-58589-1_42(700-717)Online publication date: 23-Aug-2020
https://dl.acm.org/doi/10.1007/978-3-030-58589-1_42
Desai UYarra CGhosh P(2018)Concatenative Articulatory Video Synthesis Using Real-Time MRI Data for Spoken Language Training2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2018.8462401(4999-5003)Online publication date: 15-Apr-2018
https://dl.acm.org/doi/10.1109/ICASSP.2018.8462401
Suwajanakorn SSeitz SKemelmacher-Shlizerman I(2017)Synthesizing ObamaACM Transactions on Graphics10.1145/3072959.307364036:4(1-13)Online publication date: 20-Jul-2017
https://dl.acm.org/doi/10.1145/3072959.3073640
Filntisis PKatsamanis ATsiakoulis PMaragos P(2017)Video-realistic expressive audio-visual speech synthesis for the Greek languageSpeech Communication10.1016/j.specom.2017.08.01195:C(137-152)Online publication date: 1-Dec-2017
https://dl.acm.org/doi/10.1016/j.specom.2017.08.011
Raptis SGiagkou M(2016)From capturing to generating human behaviorProceedings of the 20th Pan-Hellenic Conference on Informatics10.1145/3003733.3003814(1-6)Online publication date: 10-Nov-2016
https://dl.acm.org/doi/10.1145/3003733.3003814
Show More Cited By

View Options

View options

Figures

Tables

Media

View Issue’s Table of Contents