Remdis: Realtime Multimodal Dialogue System Toolkit

Remdis is a platform for developing multimodal dialogue systems. This repository contains the English version of Remdis.

Features

A module-based dialogue system built on asynchronous processing
Messaging with Incremental Units (IU) and incremental processing based on Incremental Modules (IM)
Pseudo-sequential response generation through parallel execution and streaming generation using a Large Language Model (ChatGPT)
Turn-taking using Voice Activity Projection (VAP)
Multimodal interaction by collaborating with MMDAgent-EX
Python implementation, cross-platform support (Windows/MacOS/Ubuntu)
Multimodal input/output

Update History

2025/1/24: Release

Example of a Dialogue Agent Using Remdis (Video)

In the video, in addition to turn-taking, Audio VAP determines the timing of backchannel.

Installation

Note: When installing on a Windows environment, it is recommended to use Command Prompt instead of WSL due to compatibility issues with audio devices.

Step 1. Preparation

Remdis uses Docker for running RabbitMQ. If you use Audio VAP with GPU support, you need to install the CUDA Toolkit and CuDNN. It can also run without a GPU, but the real-time performance may be slightly reduced.

For development and execution, we recommend using a Python environment.

Install Docker Desktop
- MacOS
```
brew install --cask docker
```
- Ubuntu
  - Download and install the latest deb package (see this page)
```
sudo apt-get install ./docker-desktop-<version>-<arch>.deb
```
- Windows
  - Download and run the installer from Docker docs
(Optional) Install CUDA Toolkit/CuDNN
- Follow the official documentation for installation on Windows/Ubuntu.
- For installation on Windows, you will need to install Visual C++ by following these steps:
  - Download and run the installer
  - Check "Desktop development with C++" and proceed with the installation.

Step 2. Install Remdis

Clone the repository

git clone https://github.com/remdis/remdis-en.git

Install dependencies

cd remdis

# It is recommended to run within a virtual environment
# Example
conda create -n remdis python=3.11
conda activate remdis

# Install dependencies
pip3 install -r requirements.txt

Step 3. Obtain and Configure API Keys

Obtain a Google Speech Cloud API key in JSON format and specify the path in config/config.yaml
```
ASR:
 ...
 json_key: <enter your API key>
```
Obtain an OpenAI API key and specify it in config/config.yaml
```
ChatGPT:
  api_key: <enter your API key>
```

Step 4. Install VAP

Clone the repository

git clone https://github.com/ErikEkstedt/VAP.git

Install dependencies

# Install pytorch, torchvision, and torchaudio
# On macOS, when using the GPU, the conv1d function in the latest version of Torch is not compatible with the VAP specifications, so it is necessary to downgrade the version.
(Optional) pip3 install torch==2.1.0 torchaudio

# Install the main package
pip3 install -r requirements.txt
pip3 install -e .

# Additionally, install torchsummary (not included in the original requirements)
pip3 install torchsummary

# Unzip the model
# Note: This model is fine-tuned for Japanese conversation, but since the base model is trained on the Switchboard corpus, it should also work in English.
cd models/vap
unzip sw2japanese_public0.zip

Step 5. Install MMDAgent-EX (Except for Windows)

For OS other than Windows, follow the installation instructions on the MMDAgent-EX official website (How To Build).
For Windows, proceed to the next step as the binaries are included.

Usage

Note: Each IMs must be executed in a separate terminal prompt. For example, if you need to run 3 IMs, open three separate prompts, activate the virtual environment in each, and run the respective IM (Python program) in each.

Text Dialogue

Run the RabbitMQ server

# If using Docker Desktop, ensure the application is started beforehand
docker run -it --rm --name rabbitmq -p 5672:5672 -p 15672:15672 rabbitmq:3.12-management

Activate the virtual environment
```
# Example
conda activate remdis
```

Start 3 IMs

python tin.py
python dialogue.py
python tout.py

Spoken Dialogue

Run the RabbitMQ server

# If using Docker Desktop, ensure the application is started beforehand
docker run -it --rm --name rabbitmq -p 5672:5672 -p 15672:15672 rabbitmq:3.12-management

Activate the virtual environment
```
# Example
conda activate remdis
```

Start 6 IMs (It is recommended to use a headset to avoid system utterances being recognized as input)

# Note: The models for speech synthesis will be downloaded by the TTS module only during the first run.
python input.py
python audio_vap.py or text_vap.py
python asr.py
python dialogue.py
python tts.py
python output.py

Multimodal interaction with MMDAgent-EX

Run the RabbitMQ server

# If using Docker Desktop, ensure the application is started beforehand
docker run -it --rm --name rabbitmq -p 5672:5672 -p 15672:15672 rabbitmq:3.12-management

Activate the virtual environment
```
# Example
conda activate remdis
```
Start 5 IMs (It is recommended to use a headset to avoid system utterances being recognized as input)
```
python input.py
python audio_vap.py or text_vap.py
python asr.py
python dialogue.py
python tts.py
```
Start MMDAgent-EX
- On Windows: Run MMDAgent-EX/run.vbs
- On other OS: Specify the main.mdf file and run MMDAgent-EX
```
cd MMDAgent-EX
/somewhere/MMDAgent-EX/Release/MMDAgent-EX main.mdf
```

TIPS

Confirming Proper Microphone and Speaker Connection

Run chk_mic_spk.py

# If you hear your own voice played back, the setup is correct
python input.py
python chk_mic_spk.py
python output.py

Visualizing Audio VAP Output

Run draw_vap_result.py

# Example for spoken dialogue
python input.py
python audio_vap.py
python asr.py
python dialogue.py
python tts.py
python output.py
python draw_vap_result.py

Setting the System to Speak After a Certain Timeout

Run time_out.py

# Example for text dialogue
python tin.py
python dialogue.py
python tout.py
python time_out.py

License

Source Code License

The license for the original files in this repository, excluding the pre-trained VAP models under models/vap, is the Apache License 2.0. You may use it for both commercial and non-commercial purposes. For the license regarding the CG avatars included with MMDAgent-EX, please refer to MMDAgent-EX/asset/models/README.txt. In addition, please be adhere to the licenses already applied to other files.

Copyright 2024 Ryuichiro Higashinaka, Koh Mitsuda, Yuya Chiba, Akinobu Lee
   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.

Pre-trained VAP Model Usage Terms

The Audio VAP models are trained using the following dialogue datasets:

Switchboard-1 Release 2
CALLHOME Japanese Speech
Travel Agency Task Dialogue Corpus (Tabidachi)
Speech Dialogue Corpus recorded by the Higashinaka Lab at Nagoya University

The pre-trained VAP models are available for academic research purposes only. Please refer to the usage terms of each corpus. The authors disclaim any liability for claims, damages, or other responsibilities arising from the use of these pre-trained models.

External Package Usage Terms

Remdis utilizes several external packages: Google Cloud Speech-to-Text API for speech recognition, TTS for speech synthesis, OpenAI API for dialogue generation, and VAP for turn-taking. Please refer to the respective licenses of these packages.

Citations

Remdis

@inproceedings{remdis2024iwsds,
  title={The Remdis toolkit: Building advanced real-time multimodal dialogue systems with incremental processing and large language models},
  author={Chiba, Yuya and Mitsuda, Koh and Lee, Akinobu and Higashinaka, Ryuichiro},
  booktitle={Proc. IWSDS},
  pages={1--6},
  year={2024},
}

Acknowledgments

This work was supported by JST Moonshot Goal 1, "Avatar-Symbiotic Society," JPMJMS2011.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Remdis: Realtime Multimodal Dialogue System Toolkit

Features

Update History

Example of a Dialogue Agent Using Remdis (Video)

Installation

Step 1. Preparation

Step 2. Install Remdis

Step 3. Obtain and Configure API Keys

Step 4. Install VAP

Step 5. Install MMDAgent-EX (Except for Windows)

Usage

Text Dialogue

Spoken Dialogue

Multimodal interaction with MMDAgent-EX

TIPS

Confirming Proper Microphone and Speaker Connection

Visualizing Audio VAP Output

Setting the System to Speak After a Certain Timeout

License

Source Code License

Pre-trained VAP Model Usage Terms

External Package Usage Terms

Citations

Remdis

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
MMDAgent-EX		MMDAgent-EX
config		config
models/vap		models/vap
modules		modules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

remdis/remdis-en

Folders and files

Latest commit

History

Repository files navigation

Remdis: Realtime Multimodal Dialogue System Toolkit

Features

Update History

Example of a Dialogue Agent Using Remdis (Video)

Installation

Step 1. Preparation

Step 2. Install Remdis

Step 3. Obtain and Configure API Keys

Step 4. Install VAP

Step 5. Install MMDAgent-EX (Except for Windows)

Usage

Text Dialogue

Spoken Dialogue

Multimodal interaction with MMDAgent-EX

TIPS

Confirming Proper Microphone and Speaker Connection

Visualizing Audio VAP Output

Setting the System to Speak After a Certain Timeout

License

Source Code License

Pre-trained VAP Model Usage Terms

External Package Usage Terms

Citations

Remdis

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages