Introducing QuBERT: A Large Monolingual Corpus and BERT Model for Southern Quechua

Rodolfo Zevallos, John Ortega, William Chen, Richard Castro, Núria Bel, Cesar Yoshikawa, Renzo Venturas, Hilario Aradiel, Nelsi Melgarejo

Abstract

The lack of resources for languages in the Americas has proven to be a problem for the creation of digital systems such as machine translation, search engines, chat bots, and more. The scarceness of digital resources for a language causes a higher impact on populations where the language is spoken by millions of people. We introduce the first official large combined corpus for deep learning of an indigenous South American low-resource language spoken by millions called Quechua. Specifically, our curated corpus is created from text gathered from the southern region of Peru where a dialect of Quechua is spoken that has not traditionally been used for digital systems as a target dialect in the past. In order to make our work repeatable by others, we also offer a public, pre-trained, BERT model called QuBERT which is the largest linguistic model ever trained for any Quechua type, not just the southern region dialect. We furthermore test our corpus and its corresponding BERT model on two major tasks: (1) named-entity recognition (NER) and (2) part-of-speech (POS) tagging by using state-of-the-art techniques where we achieve results comparable to other work on higher-resource languages. In this article, we describe the methodology, challenges, and results from the creation of QuBERT which is on par with other state-of-the-art multilingual models for natural language processing achieving between 71 and 74% F1 score on NER and 84–87% on POS tasks.

Anthology ID:: 2022.deeplo-1.1
Volume:: Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing
Month:: July
Year:: 2022
Address:: Hybrid
Editors:: Colin Cherry, Angela Fan, George Foster, Gholamreza (Reza) Haffari, Shahram Khadivi, Nanyun (Violet) Peng, Xiang Ren, Ehsan Shareghi, Swabha Swayamdipta
Venue:: DeepLo
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1–13
Language:
URL:: https://aclanthology.org/2022.deeplo-1.1
DOI:: 10.18653/v1/2022.deeplo-1.1
Bibkey:
Cite (ACL):: Rodolfo Zevallos, John Ortega, William Chen, Richard Castro, Núria Bel, Cesar Yoshikawa, Renzo Venturas, Hilario Aradiel, and Nelsi Melgarejo. 2022. Introducing QuBERT: A Large Monolingual Corpus and BERT Model for Southern Quechua. In Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 1–13, Hybrid. Association for Computational Linguistics.
Cite (Informal):: Introducing QuBERT: A Large Monolingual Corpus and BERT Model for Southern Quechua (Zevallos et al., DeepLo 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.deeplo-1.1.pdf
Video:: https://aclanthology.org/2022.deeplo-1.1.mp4

PDF Cite Search Video