[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3503161.3547831acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article
Open access

Inferring Speaking Styles from Multi-modal Conversational Context by Multi-scale Relational Graph Convolutional Networks

Published: 10 October 2022 Publication History

Abstract

To support applications of speech-driven interactive systems in various conversational scenarios, text-to-speech (TTS) synthesis needs to understand the conversational context and determine appropriate speaking styles in its synthesized speeches. These speaking styles are influenced by the dependencies between the multi-modal information in the context at both global scale (i.e. utterance level) and local scale (i.e. word level). However, the dependency modeling and speaking style inference at the local scale are largely missing in state-of-the-art TTS systems, resulting in the synthesis of incorrect or improper speaking styles. In this paper, to learn the dependencies in conversations at both global and local scales and to improve the synthesis of speaking styles, we propose a context modeling method which models the dependencies among the multi-modal information in context with multi-scale relational graph convolutional network (MSRGCN). The learnt multi-modal context information at multiple scales is then utilized to infer the global and local speaking styles of the current utterance for speech synthesis. Experiments demonstrate the effectiveness of the proposed approach, and ablation studies reflect the contributions from modeling multi-modal information and multi-scale dependencies.

Supplementary Material

MP4 File (mm2022-conversational-tts-presentation.mp4)
Presentation video

References

[1]
Kyunghyun Cho, Bart van Merrienboer, cC aglar Gülcc ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In EMNLP.
[2]
Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, and Dan Su. 2021. Controllable Context-aware Conversational Speech Synthesis. arXiv preprint arXiv:2106.10828 (2021).
[3]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[4]
Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International conference on machine learning. PMLR, 1180--1189.
[5]
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Francc ois Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. The journal of machine learning research, Vol. 17, 1 (2016), 2096--2030.
[6]
Jun Gao, Wei Bi, Xiaojiang Liu, Junhui Li, and Shuming Shi. 2019. Generating Multiple Diverse Responses for Short-Text Conversation. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 01 (July 2019), 6383--6390. https://doi.org/10.1609/aaai.v33i01.33016383 Number: 01.
[7]
Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. 2019. DialogueGCN: A graph convolutional neural network for emotion recognition in conversation. In EMNLP-IJCNLP 2019--2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference.
[8]
Haohan Guo, Shaofei Zhang, Frank K Soong, Lei He, and Lei Xie. 2021. Conversational end-to-end tts for voice agents. In 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 403--409.
[9]
Matthew B Hoy. 2018. Alexa, Siri, Cortana, and more: an introduction to voice assistants. Medical reference services quarterly, Vol. 37, 1 (2018), 81--88.
[10]
Dou Hu, Xiaolong Hou, Lingwei Wei, Lianxin Jiang, and Yang Mo. 2022. MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7037--7041. https://doi.org/10.1109/ICASSP43922.2022.9747397 ISSN: 2379--190X.
[11]
Taichi Ishiwatari, Yuki Yasuda, Taro Miyazaki, and Jun Goto. 2020. Relation-aware Graph Attention Networks with Relational Position Encodings for Emotion Recognition in Conversations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 7360--7370. https://doi.org/10.18653/v1/2020.emnlp-main.597
[12]
Magdalena Kaiser, Rishiraj Saha Roy, and Gerhard Weikum. 2020. Conversational Question Answering over Passages by Leveraging Word Proximity Networks. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '20). Association for Computing Machinery, New York, NY, USA, 2129--2132. https://doi.org/10.1145/3397271.3401399
[13]
Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. In Proceedings of the 38th International Conference on Machine Learning. PMLR, 5530--5540. https://proceedings.mlr.press/v139/kim21f.html ISSN: 2640--3498.
[14]
Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR) (2017).
[15]
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, Vol. 33 (2020), 17022--17033.
[16]
Chia-Chih Kuo, Shang-Bao Luo, and Kuan-Yu Chen. 2020. An Audio-enriched BERT-based Framework for Spoken Multiple-choice Question Answering. arXiv:2005.12142 [cs] (May 2020). http://arxiv.org/abs/2005.12142 arXiv: 2005.12142.
[17]
Siddique Latif, Junaid Qadir, and Muhammad Bilal. 2019. Unsupervised Adversarial Domain Adaptation for Cross-Lingual Speech Emotion Recognition. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII). 732--737. https://doi.org/10.1109/ACII.2019.8925513 ISSN: 2156--8111.
[18]
Jingbei Li, Yi Meng, Chenyi Li, Zhiyong Wu, Helen Meng, Chao Weng, and Dan Su. 2022a. Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-based Multi-modal Context Modeling. https://doi.org/10.48550/ARXIV.2106.06233
[19]
Jingbei Li, Yi Meng, Zhiyong Wu, Helen Meng, Qiao Tian, Yuping Wang, and Yuxuan Wang. 2022b. NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism. https://doi.org/10.48550/ARXIV.2203.16838
[20]
Runnan Li, Zhiyong Wu, Jia Jia, Jingbei Li, Wei Chen, and Helen Meng. 2018a. Inferring user emotive state changes in realistic human-computer conversational dialogs. In Proceedings of the 26th ACM international conference on Multimedia. 136--144.
[21]
Runnan Li, Zhiyong Wu, Jia Jia, Jingbei Li, Wei Chen, and Helen Meng. 2018b. Inferring User Emotive State Changes in Realistic Human-Computer Conversational Dialogs. In Proceedings of the 26th ACM International Conference on Multimedia (MM '18). ACM, New York, NY, USA, 136--144. https://doi.org/10.1145/3240508.3240575 event-place: Seoul, Republic of Korea.
[22]
Zheng Lian, Jianhua Tao, Bin Liu, Jian Huang, Zhanlei Yang, and Rongjun Li. 2020. Conversational Emotion Recognition Using Self-Attention Mechanisms and Graph Neural Networks. Proc. Interspeech 2020 (2020), 2347--2351.
[23]
Hongyin Luo, Shang-Wen Li, and James Glass. 2020. Prototypical Q Networks for Automatic Conversational Diagnosis and Few-Shot New Disease Adaption. arXiv:2005.11153 [cs] (May 2020). http://arxiv.org/abs/2005.11153 arXiv: 2005.11153.
[24]
Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander Gelbukh, and Erik Cambria. 2019. Dialoguernn: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6818--6825.
[25]
Chenfeng Miao, Shuang Liang, Minchuan Chen, Jun Ma, Shaojun Wang, and Jing Xiao. 2020. Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7209--7213. https://doi.org/10.1109/ICASSP40776.2020.9054484 ISSN: 2379--190X.
[26]
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. In International Conference on Learning Representations.
[27]
Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. FastSpeech: fast, robust and controllable text to speech. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. Number 285. Curran Associates Inc., Red Hook, NY, USA, 3171--3180.
[28]
Yu-Ping Ruan, Shu-Kai Zheng, Taihao Li, Fen Wang, and Guanxiong Pei. 2022. Hierarchical and Multi-View Dependency Modelling Network for Conversational Emotion Recognition. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7032--7036. https://doi.org/10.1109/ICASSP43922.2022.9747123 ISSN: 2379--190X.
[29]
Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In European semantic web conference. Springer, 593--607.
[30]
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2017. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. arXiv:1712.05884 [cs] (Dec. 2017). http://arxiv.org/abs/1712.05884 arXiv: 1712.05884.
[31]
Weizhou Shen, Junqing Chen, Xiaojun Quan, and Zhixian Xie. 2021. DialogXL: All-in-One XLNet for Multi-Party Conversation Emotion Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 15 (May 2021), 13789--13797. https://ojs.aaai.org/index.php/AAAI/article/view/17625 Number: 15.
[32]
Xiaohan Shi, Sixia Li, and Jianwu Dang. 2020. Dimensional Emotion Prediction based on Interactive Context in Conversation. Proc. Interspeech 2020 (2020), 4193--4197.
[33]
Heung-Yeung Shum, Xiao-dong He, and Di Li. 2018. From Eliza to XiaoIce: challenges and opportunities with social chatbots. Frontiers of Information Technology & Electronic Engineering, Vol. 19, 1 (2018), 10--26.
[34]
RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron Weiss, Rob Clark, and Rif A Saurous. 2018. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In international conference on machine learning. PMLR, 4693--4702.
[35]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).
[36]
Alessandro Vinciarelli, Anna Esposito, Elisabeth André, Francesca Bonin, Mohamed Chetouani, Jeffrey F. Cohn, Marco Cristani, Ferdinand Fuhrmann, Elmer Gilmartin, Zakia Hammal, Dirk Heylen, Rene Kaiser, Maria Koutsombogera, Alexandros Potamianos, Steve Renals, Giuseppe Riccardi, and Albert Ali Salah. 2015. Open Challenges in Modelling, Analysis and Synthesis of Human Behaviour in Human--Human and Human--Machine Interactions. Cognitive Computation, Vol. 7, 4 (Aug. 2015), 397--413. https://doi.org/10.1007/s12559-015--9326-z
[37]
Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. 2017. Tacotron: Towards End-to-End Speech Synthesis. Proc. Interspeech 2017 (2017), 4006--4010.
[38]
Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A Saurous. 2018. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning. PMLR, 5180--5189.
[39]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).
[40]
Yunhe Xie, Chengjie Sun, and Zhenzhou Ji. 2022. A Commonsense Knowledge Enhanced Network with Retrospective Loss for Emotion Recognition in Spoken Dialog. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7027--7031. https://doi.org/10.1109/ICASSP43922.2022.9746909 ISSN: 2379--190X.
[41]
Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, R. J. Skerry-Ryan, Ye Jia, Andrew Rosenberg, and Bhuvana Ramabhadran. 2019. Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning. arXiv:1907.04448 [cs, eess] (July 2019). http://arxiv.org/abs/1907.04448.
[42]
Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. 2020. The design and implementation of xiaoice, an empathetic social chatbot. Computational Linguistics, Vol. 46, 1 (2020), 53--93.

Cited By

View all
  • (2024)Generative Expressive Conversational Speech SynthesisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681697(4187-4196)Online publication date: 28-Oct-2024
  • (2024)Contrastive Context-Speech Pretraining for Expressive Text-to-Speech SynthesisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681348(2099-2107)Online publication date: 28-Oct-2024
  • (2024)Inferring Agent Speaking Styles for Auditory-Visual User-Agent Conversation2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP)10.1109/ISCSLP63861.2024.10800066(421-425)Online publication date: 7-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
October 2022
7537 pages
ISBN:9781450392037
DOI:10.1145/3503161
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. context modeling
  2. conversational speech synthesis
  3. multi-scale graph convolution network
  4. speaking style
  5. speech interaction system

Qualifiers

  • Research-article

Funding Sources

  • National Key Research and Development Plan
  • Shenzhen Key Laboratory of next generation interactive media innovative technology
  • National Natural Science Foundation of China
  • National Natural Science Foundation of China-Research Grant Council of Hong Kong

Conference

MM '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)328
  • Downloads (Last 6 weeks)45
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Generative Expressive Conversational Speech SynthesisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681697(4187-4196)Online publication date: 28-Oct-2024
  • (2024)Contrastive Context-Speech Pretraining for Expressive Text-to-Speech SynthesisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681348(2099-2107)Online publication date: 28-Oct-2024
  • (2024)Inferring Agent Speaking Styles for Auditory-Visual User-Agent Conversation2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP)10.1109/ISCSLP63861.2024.10800066(421-425)Online publication date: 7-Nov-2024
  • (2024)FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP)10.1109/ISCSLP63861.2024.10800061(299-303)Online publication date: 7-Nov-2024
  • (2024)Concss: Contrastive-based Context Comprehension for Dialogue-Appropriate Prosody in Conversational Speech SynthesisICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446506(10706-10710)Online publication date: 14-Apr-2024
  • (2023)Human-computer interaction for virtual-real fusionJournal of Image and Graphics10.11834/jig.23002028:6(1513-1542)Online publication date: 2023
  • (2023)Emotionally Situated Text-to-Speech Synthesis in User-Agent ConversationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613823(5966-5974)Online publication date: 26-Oct-2023
  • (2023)CMCU-CSS: Enhancing Naturalness via Commonsense-based Multi-modal Context Understanding in Conversational Speech SynthesisProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612565(6081-6089)Online publication date: 27-Oct-2023
  • (2023)Joint Multiscale Cross-Lingual Speaking Style Transfer With Bidirectional Attention Mechanism for Automatic DubbingIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2023.333181332(517-528)Online publication date: 10-Nov-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media