Becoming a High-Resource Language in Speech: The Catalan Case in the Common Voice Corpus

Carme Armentano-Oller, Montserrat Marimon, Marta Villegas

Abstract

Collecting voice resources for speech recognition systems is a multifaceted challenge, involving legal, technical, and diversity considerations. However, it is crucial to ensure fair access to voice-driven technology across diverse linguistic backgrounds. We describe an ongoing effort to create an extensive, high-quality, publicly available voice dataset for future development of speech technologies in Catalan through the Mozilla Common Voice crowd-sourcing platform. We detail the specific approaches used to address the challenges faced in recruiting contributors and managing the collection, validation, and recording of sentences. This detailed overview can serve as a source of guidance for similar initiatives across other projects and linguistic contexts. The success of this project is evident in the latest corpus release, version 16.1, where Catalan ranks as the most prominent language in the corpus, both in terms of recorded hours and when considering validated hours. This establishes Catalan as a language with significant speech resources for language technology development and significantly raises its international visibility.

Anthology ID:: 2024.lrec-main.193
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 2142–2148
Language:
URL:: https://aclanthology.org/2024.lrec-main.193
DOI:
Bibkey:
Cite (ACL):: Carme Armentano-Oller, Montserrat Marimon, and Marta Villegas. 2024. Becoming a High-Resource Language in Speech: The Catalan Case in the Common Voice Corpus. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2142–2148, Torino, Italia. ELRA and ICCL.
Cite (Informal):: Becoming a High-Resource Language in Speech: The Catalan Case in the Common Voice Corpus (Armentano-Oller et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.lrec-main.193.pdf

PDF Cite Search