More Web Proxy on the site http://driver.im/

research-article

Open access

Inferring Speaking Styles from Multi-modal Conversational Context by Multi-scale Relational Graph Convolutional Networks

Authors:

Yuxuan WangAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 5811 - 5820

https://doi.org/10.1145/3503161.3547831

Published: 10 October 2022 Publication History

Abstract

To support applications of speech-driven interactive systems in various conversational scenarios, text-to-speech (TTS) synthesis needs to understand the conversational context and determine appropriate speaking styles in its synthesized speeches. These speaking styles are influenced by the dependencies between the multi-modal information in the context at both global scale (i.e. utterance level) and local scale (i.e. word level). However, the dependency modeling and speaking style inference at the local scale are largely missing in state-of-the-art TTS systems, resulting in the synthesis of incorrect or improper speaking styles. In this paper, to learn the dependencies in conversations at both global and local scales and to improve the synthesis of speaking styles, we propose a context modeling method which models the dependencies among the multi-modal information in context with multi-scale relational graph convolutional network (MSRGCN). The learnt multi-modal context information at multiple scales is then utilized to infer the global and local speaking styles of the current utterance for speech synthesis. Experiments demonstrate the effectiveness of the proposed approach, and ablation studies reflect the contributions from modeling multi-modal information and multi-scale dependencies.

Supplementary Material

MP4 File (mm2022-conversational-tts-presentation.mp4)

Presentation video

Download
38.82 MB

References

[1]

Kyunghyun Cho, Bart van Merrienboer, cC aglar Gülcc ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In EMNLP.

[2]

Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, and Dan Su. 2021. Controllable Context-aware Conversational Speech Synthesis. arXiv preprint arXiv:2106.10828 (2021).

[3]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[4]

Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International conference on machine learning. PMLR, 1180--1189.

[5]

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Francc ois Laviolette, Mario Marchand, and Victor Lempitsky. 2016. Domain-adversarial training of neural networks. The journal of machine learning research, Vol. 17, 1 (2016), 2096--2030.

[6]

Jun Gao, Wei Bi, Xiaojiang Liu, Junhui Li, and Shuming Shi. 2019. Generating Multiple Diverse Responses for Short-Text Conversation. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 01 (July 2019), 6383--6390. https://doi.org/10.1609/aaai.v33i01.33016383 Number: 01.

Digital Library

[7]

Deepanway Ghosal, Navonil Majumder, Soujanya Poria, Niyati Chhaya, and Alexander Gelbukh. 2019. DialogueGCN: A graph convolutional neural network for emotion recognition in conversation. In EMNLP-IJCNLP 2019--2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference.

[8]

Haohan Guo, Shaofei Zhang, Frank K Soong, Lei He, and Lei Xie. 2021. Conversational end-to-end tts for voice agents. In 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 403--409.

[9]

Matthew B Hoy. 2018. Alexa, Siri, Cortana, and more: an introduction to voice assistants. Medical reference services quarterly, Vol. 37, 1 (2018), 81--88.

[10]

Dou Hu, Xiaolong Hou, Lingwei Wei, Lianxin Jiang, and Yang Mo. 2022. MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7037--7041. https://doi.org/10.1109/ICASSP43922.2022.9747397 ISSN: 2379--190X.

[11]

Taichi Ishiwatari, Yuki Yasuda, Taro Miyazaki, and Jun Goto. 2020. Relation-aware Graph Attention Networks with Relational Position Encodings for Emotion Recognition in Conversations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 7360--7370. https://doi.org/10.18653/v1/2020.emnlp-main.597

[12]

Magdalena Kaiser, Rishiraj Saha Roy, and Gerhard Weikum. 2020. Conversational Question Answering over Passages by Leveraging Word Proximity Networks. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '20). Association for Computing Machinery, New York, NY, USA, 2129--2132. https://doi.org/10.1145/3397271.3401399

Digital Library

[13]

Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech. In Proceedings of the 38th International Conference on Machine Learning. PMLR, 5530--5540. https://proceedings.mlr.press/v139/kim21f.html ISSN: 2640--3498.

[14]

Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. International Conference on Learning Representations (ICLR) (2017).

[15]

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, Vol. 33 (2020), 17022--17033.

[16]

Chia-Chih Kuo, Shang-Bao Luo, and Kuan-Yu Chen. 2020. An Audio-enriched BERT-based Framework for Spoken Multiple-choice Question Answering. arXiv:2005.12142 [cs] (May 2020). http://arxiv.org/abs/2005.12142 arXiv: 2005.12142.

[17]

Siddique Latif, Junaid Qadir, and Muhammad Bilal. 2019. Unsupervised Adversarial Domain Adaptation for Cross-Lingual Speech Emotion Recognition. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII). 732--737. https://doi.org/10.1109/ACII.2019.8925513 ISSN: 2156--8111.

[18]

Jingbei Li, Yi Meng, Chenyi Li, Zhiyong Wu, Helen Meng, Chao Weng, and Dan Su. 2022a. Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-based Multi-modal Context Modeling. https://doi.org/10.48550/ARXIV.2106.06233

[19]

Jingbei Li, Yi Meng, Zhiyong Wu, Helen Meng, Qiao Tian, Yuping Wang, and Yuxuan Wang. 2022b. NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism. https://doi.org/10.48550/ARXIV.2203.16838

[20]

Runnan Li, Zhiyong Wu, Jia Jia, Jingbei Li, Wei Chen, and Helen Meng. 2018a. Inferring user emotive state changes in realistic human-computer conversational dialogs. In Proceedings of the 26th ACM international conference on Multimedia. 136--144.

Digital Library

[21]

Runnan Li, Zhiyong Wu, Jia Jia, Jingbei Li, Wei Chen, and Helen Meng. 2018b. Inferring User Emotive State Changes in Realistic Human-Computer Conversational Dialogs. In Proceedings of the 26th ACM International Conference on Multimedia (MM '18). ACM, New York, NY, USA, 136--144. https://doi.org/10.1145/3240508.3240575 event-place: Seoul, Republic of Korea.

Digital Library

[22]

Zheng Lian, Jianhua Tao, Bin Liu, Jian Huang, Zhanlei Yang, and Rongjun Li. 2020. Conversational Emotion Recognition Using Self-Attention Mechanisms and Graph Neural Networks. Proc. Interspeech 2020 (2020), 2347--2351.

[23]

Hongyin Luo, Shang-Wen Li, and James Glass. 2020. Prototypical Q Networks for Automatic Conversational Diagnosis and Few-Shot New Disease Adaption. arXiv:2005.11153 [cs] (May 2020). http://arxiv.org/abs/2005.11153 arXiv: 2005.11153.

[24]

Navonil Majumder, Soujanya Poria, Devamanyu Hazarika, Rada Mihalcea, Alexander Gelbukh, and Erik Cambria. 2019. Dialoguernn: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6818--6825.

Digital Library

[25]

Chenfeng Miao, Shuang Liang, Minchuan Chen, Jun Ma, Shaojun Wang, and Jing Xiao. 2020. Flow-TTS: A Non-Autoregressive Network for Text to Speech Based on Flow. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7209--7213. https://doi.org/10.1109/ICASSP40776.2020.9054484 ISSN: 2379--190X.

[26]

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. In International Conference on Learning Representations.

[27]

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. FastSpeech: fast, robust and controllable text to speech. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. Number 285. Curran Associates Inc., Red Hook, NY, USA, 3171--3180.

[28]

Yu-Ping Ruan, Shu-Kai Zheng, Taihao Li, Fen Wang, and Guanxiong Pei. 2022. Hierarchical and Multi-View Dependency Modelling Network for Conversational Emotion Recognition. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7032--7036. https://doi.org/10.1109/ICASSP43922.2022.9747123 ISSN: 2379--190X.

[29]

Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In European semantic web conference. Springer, 593--607.

[30]

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2017. Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. arXiv:1712.05884 [cs] (Dec. 2017). http://arxiv.org/abs/1712.05884 arXiv: 1712.05884.

[31]

Weizhou Shen, Junqing Chen, Xiaojun Quan, and Zhixian Xie. 2021. DialogXL: All-in-One XLNet for Multi-Party Conversation Emotion Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, 15 (May 2021), 13789--13797. https://ojs.aaai.org/index.php/AAAI/article/view/17625 Number: 15.

[32]

Xiaohan Shi, Sixia Li, and Jianwu Dang. 2020. Dimensional Emotion Prediction based on Interactive Context in Conversation. Proc. Interspeech 2020 (2020), 4193--4197.

[33]

Heung-Yeung Shum, Xiao-dong He, and Di Li. 2018. From Eliza to XiaoIce: challenges and opportunities with social chatbots. Frontiers of Information Technology & Electronic Engineering, Vol. 19, 1 (2018), 10--26.

[34]

RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron Weiss, Rob Clark, and Rif A Saurous. 2018. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In international conference on machine learning. PMLR, 4693--4702.

[35]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, Vol. 30 (2017).

[36]

Alessandro Vinciarelli, Anna Esposito, Elisabeth André, Francesca Bonin, Mohamed Chetouani, Jeffrey F. Cohn, Marco Cristani, Ferdinand Fuhrmann, Elmer Gilmartin, Zakia Hammal, Dirk Heylen, Rene Kaiser, Maria Koutsombogera, Alexandros Potamianos, Steve Renals, Giuseppe Riccardi, and Albert Ali Salah. 2015. Open Challenges in Modelling, Analysis and Synthesis of Human Behaviour in Human--Human and Human--Machine Interactions. Cognitive Computation, Vol. 7, 4 (Aug. 2015), 397--413. https://doi.org/10.1007/s12559-015--9326-z

[37]

Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. 2017. Tacotron: Towards End-to-End Speech Synthesis. Proc. Interspeech 2017 (2017), 4006--4010.

[38]

Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A Saurous. 2018. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International Conference on Machine Learning. PMLR, 5180--5189.

[39]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface's transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019).

[40]

Yunhe Xie, Chengjie Sun, and Zhenzhou Ji. 2022. A Commonsense Knowledge Enhanced Network with Retrospective Loss for Emotion Recognition in Spoken Dialog. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7027--7031. https://doi.org/10.1109/ICASSP43922.2022.9746909 ISSN: 2379--190X.

[41]

Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, R. J. Skerry-Ryan, Ye Jia, Andrew Rosenberg, and Bhuvana Ramabhadran. 2019. Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning. arXiv:1907.04448 [cs, eess] (July 2019). http://arxiv.org/abs/1907.04448.

[42]

Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum. 2020. The design and implementation of xiaoice, an empathetic social chatbot. Computational Linguistics, Vol. 46, 1 (2020), 53--93.

Digital Library

Cited By

Liu RHu YRen YYin XLi HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Generative Expressive Conversational Speech SynthesisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681697(4187-4196)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681697
Xiao YWang XTan XHe LZhu XZhao SLee TCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Contrastive Context-Speech Pretraining for Expressive Text-to-Speech SynthesisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681348(2099-2107)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681348
Yang ZWang CWu ZJia J(2024)Inferring Agent Speaking Styles for Auditory-Visual User-Agent Conversation2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP)10.1109/ISCSLP63861.2024.10800066(421-425)Online publication date: 7-Nov-2024
https://doi.org/10.1109/ISCSLP63861.2024.10800066
Show More Cited By

Index Terms

Inferring Speaking Styles from Multi-modal Conversational Context by Multi-scale Relational Graph Convolutional Networks
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Discourse, dialogue and pragmatics
      2. Information extraction
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Human-centered computing
  1. Human computer interaction (HCI)

Recommendations

CMCU-CSS: Enhancing Naturalness via Commonsense-based Multi-modal Context Understanding in Conversational Speech Synthesis
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Conversational Speech Synthesis (CSS) aims to produce speech appropriate for oral communication. However, the complexity of context dependency modeling poses significant challenges in the field of CSS, especially the mutual psychological influence ...
Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis

This paper describes the modeling of various emotional expressions and speaking styles in synthetic speech using HMM-based speech synthesis. We show two methods for modeling speaking styles and emotional expressions. In the first method called style-...
Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing

This paper describes an approach to generating speech with emotional expressivity and speaking style variability. The approach is based on a speaking style and emotional expression modeling technique for HMM-based speech synthesis. We first model several ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Plan
Shenzhen Key Laboratory of next generation interactive media innovative technology
National Natural Science Foundation of China
National Natural Science Foundation of China-Research Grant Council of Hong Kong

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
688
Total Downloads

Downloads (Last 12 months)328
Downloads (Last 6 weeks)45

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu RHu YRen YYin XLi HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Generative Expressive Conversational Speech SynthesisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681697(4187-4196)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681697
Xiao YWang XTan XHe LZhu XZhao SLee TCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Contrastive Context-Speech Pretraining for Expressive Text-to-Speech SynthesisProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681348(2099-2107)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681348
Yang ZWang CWu ZJia J(2024)Inferring Agent Speaking Styles for Auditory-Visual User-Agent Conversation2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP)10.1109/ISCSLP63861.2024.10800066(421-425)Online publication date: 7-Nov-2024
https://doi.org/10.1109/ISCSLP63861.2024.10800066
Hu YLiu RGao GLi H(2024)FCTalker: Fine and Coarse Grained Context Modeling for Expressive Conversational Speech Synthesis2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP)10.1109/ISCSLP63861.2024.10800061(299-303)Online publication date: 7-Nov-2024
https://doi.org/10.1109/ISCSLP63861.2024.10800061
Deng YXue JJia YLi QHan YWang FGao YKe DLi Y(2024)Concss: Contrastive-based Context Comprehension for Dialogue-Appropriate Prosody in Conversational Speech SynthesisICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446506(10706-10710)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446506
Jianhua TJiangtao GNan GSiwei FShan LChun Y(2023)Human-computer interaction for virtual-real fusionJournal of Image and Graphics10.11834/jig.23002028:6(1513-1542)Online publication date: 2023
https://doi.org/10.11834/jig.230020
Liu YZhang HLiu SYin XMa ZJin QEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)Emotionally Situated Text-to-Speech Synthesis in User-Agent ConversationProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3613823(5966-5974)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3613823
Deng YXue JWang FGao YLi YEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)CMCU-CSS: Enhancing Naturalness via Commonsense-based Multi-modal Context Understanding in Conversational Speech SynthesisProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612565(6081-6089)Online publication date: 27-Oct-2023
https://doi.org/10.1145/3581783.3612565
Li JLi SChen PZhang LMeng YWu ZMeng HTian QWang YWang Y(2023)Joint Multiscale Cross-Lingual Speaking Style Transfer With Bidirectional Attention Mechanism for Automatic DubbingIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2023.333181332(517-528)Online publication date: 10-Nov-2023
https://dl.acm.org/doi/10.1109/TASLP.2023.3331813

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents