Google Scholar

Diffusion-based co-speech gesture generation using joint text and audio representation

A Deichler, S Mehta, S Alexanderson… - Proceedings of the 25th …, 2023 - dl.acm.org

A Deichler, S Mehta, S Alexanderson, J Beskow

Proceedings of the 25th International Conference on Multimodal Interaction, 2023•dl.acm.org

This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing diffusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the diffusion-based gesture synthesis model in order to achieve semantically-aware co-speech gesture generation. Our entry achieved highest human-likeness and highest speech appropriateness rating among the submitted entries. This indicates that our system is a promising approach to achieve human-like co-speech gestures in agents that carry semantic meaning.

ACM Digital Library

Show moreShow less

Save Cite Cited by 24 Related articles All 7 versions

Cite

Advanced search

Saved to My library

Diffusion-based co-speech gesture generation using joint text and audio representation