8000 GitHub - sarahjuan/sarawakmalay: This is a Sarawak Malay speech and text data for the purpose of speech technology research. The data was collected by Faculty of Computer Science and Information Technology, Universiti Malaysia Sarawak.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

This is a Sarawak Malay speech and text data for the purpose of speech technology research. The data was collected by Faculty of Computer Science and Information Technology, Universiti Malaysia Sarawak.

License

CC0-1.0, Unknown licenses found

Licenses found

CC0-1.0
LICENSE
Unknown
LICENSE.html
Notifications You must be signed in to change notification settings

sarahjuan/sarawakmalay

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sarawak Malay

This is a Sarawak 6C3D Malay conversation data for the purpose of speech technology research. At the moment, this is an experimental data and currently used for investigating speaker diarization. The data was collected by Faculty of Computer Science and Information Technology, Universiti Malaysia Sarawak.

The data consists of 38 conversations that have been transcribed using Transcriber (see TextGrid folder), where each file contains two speakers. Each conversation was recorded by different individuals using microphones from mobile devices or laptops thus, different file formats were collected from the data collectors. All data was then standardized to mono, 16000Khz, wav format.

We provide files:

  • wav
  • rttm
  • textgrid

This data was experimented for a speaker diarizarion task, where it was used for evaluating our speaker diarization models. Our work was presented at the recent IALP 2023 in Singapore.

Cite our work when this data is used:

@INPROCEEDINGS{
10337314,
author={Rahim, Mohd Zulhafiz and Juan, Sarah Samson and Mohamad, Fitri Suraya},
booktitle={2023 International Conference on Asian Language Processing (IALP)},
title={Improving Speaker Diarization for Low-Resourced Sarawak Malay Language Conversational Speech Corpus},
year={2023},
pages={228-233},
keywords={Training;Oral communication;Data models;Usability;Speech processing;Testing;Speaker diarization;x-vectors;clustering;low-resource;auto-labeling;pseudo-labeling;unsupervised},
doi={10.1109/IALP61005.2023.10337314}}

For further details:

Sarah Samson Juan sjsflora@unimas.my

Mohd Zulhafiz bin Rahim mzhafiz1999@gmail.com

About

This is a Sarawak Malay speech and text data for the purpose of speech technology research. The data was collected by Faculty of Computer Science and Information Technology, Universiti Malaysia Sarawak.

Resources

License

CC0-1.0, Unknown licenses found

Licenses found

CC0-1.0
LICENSE
Unknown
LICENSE.html

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  
0