Natural Language Understanding (NLU) technology has improved significantly over the last few years, and multitask benchmarks such as GLUE are key to evaluate this improvement in a robust and general way. These benchmarks take into account a wide and diverse set of NLU tasks that require some form of language understanding, beyond the detection of superficial, textual clues. However, they are costly to develop and language-dependent, and therefore they are only available for a small number of languages.
We present BasqueGLUE, the first NLU benchmark for Basque, which has been elaborated from previously existing datasets and following similar criteria to those used for the construction of GLUE and SuperGLUE. BasqueGLUE is freely available under an open license.
Work published at LREC22: BasqueGLUE: A Natural Language Understanding Benchmark for Basque
NOW ALSO AVAILABLE AT HF🤗 Datasets!
Dataset | |Train| | |Val| | |Test| | Task | Metric | Domain |
---|---|---|---|---|---|---|
NERCid | 51,539 | 12,936 | 35,855 | NERC | F1 | News |
NERCood | 64,475 | 14,945 | 14,462 | NERC | F1 | News, Wikipedia |
FMTODeu_intent | 3,418 | 1,904 | 1,087 | Intent classification | F1 | Dialog system |
FMTODeu_slot | 19,652 | 10,791 | 5,633 | Slot filling | F1 | Dialog system |
BHTCv2 | 8,585 | 1,857 | 1,854 | Topic classification | F1 | News |
BEC2016eu | 6,078 | 1,302 | 1,302 | Sentiment analysis | F1 | |
VaxxStance | 864 | 206 | 312 | Stance detection | MF1* | |
QNLIeu | 1,764 | 230 | 238 | QA/NLI | Acc | Wikipedia |
WiCeu | 408,559 | 600 | 1,400 | WSD | Acc | Wordnet |
EpecKorrefBin | 986 | 320 | 587 | Coreference resolution | Acc | News |
NERCid stands for NERC in-domain, while NERCood stands for NERC out-of-domain. Dataset sizes for sequence labeling tasks (NERC and Slot filling) are given in tokens. Acc refers to accuracy, while F1 refers to micro-average F1-score. *The metric used for VaxxStance is macro-average F1-score of two classes: FAVOR and AGAINST.
Note: The train file for WiCeu needs to be uncompressed.
NERC | |
---|---|
Tokens: | Helburuetako bat McLareni eta Ferrariri aurre egitea izango du taldeak . |
Labels: | O O B-ORG O B-ORG O O O O O O |
Translation: | One of the objectives that will have the team is to confront McLaren and Ferrari. |
Intent Class (FMTODeu_intent) | |
---|---|
Text: | alarma ezarri gaurko 6:00etan |
Translation: | set the alarm today at 6:00am |
Intent: | alarm/set alarm |
Slot Filling (FMTODeu_slot) | |
---|---|
Tokens: | Euria egingo du gaur ? |
Labels: | B-weather/attribute O O B-datetime O |
Translation: | Is it going to rain today? |
Topic Classification (BHTCv2) | |
---|---|
Text: | Gurasotasun baimena eta seme-alabak zaintzeko baimena lau hilabetera luzatzeko proposamena egitea onartu du Europako Batzordeak. Proposamenak aldaketa handia ekarriko luke Hego Euskal Herrian, lau asteetara luzatu berri baita baimen hori. |
Translation: | The European Commission has approved to make the proposal of extending paternity leave to four months. The proposal would represent an important change in Hego Euskal Herria, as it has been extended recently to four weeks. |
Topic: | Gizartea (society) |
Sentiment Analysis (BEC) | |
---|---|
Text: | Mezu txoro, patetiko eta lotsagarri hori ongi hartuko duenik badela uste du PSEk. |
translation: | PSE thinks there are people who will respond positively to that crazy, pathetic and shameful message. |
Polarity: | Negative |
Stance Detection (VaxxStance) | |
---|---|
Text: | Gure nagusiak babestuko dituen txertoa martxan da. Zor genien. Gaur mundua apur bat hobeagoa da. #OsasunPublikoarenGaraipena #GureGaraipena |
Translation: | The vaccine that will protect our elderly people is on it’s way. We owned them. Today the world is a little bit better. #TheVictoryOfPublicHealthcare #OurVictory |
Stance: | FAVOR |
QNLI | |
---|---|
Question: | “Irrintziaren oihartzunak” dokumentalaz gain, zein best lan egin ditu zinema arloan? |
Translation: | Aside from the documentary “Irrintziaren ohiartzunak”, in what other projects has she worked on in the field of cinema? |
Sentence: | “Irrintziaren oihartzunak” du lehen filma zuzendari eta gidoilari gisa. |
Translation: | “Irrintziaren oihartzunak” is her first film as a director and scriptwriter. |
NLI: | not_entailment |
WiC | |
---|---|
Sentence1: | Asterix, zazpi [egunen] segida asmatu zuen galiarra . |
Translation: | Asterix, the Gaul who invented the 7 [days] week. |
Sentence2: | Etxeko landareek sasoi aktiboan tenperatura epelak behar dituzte : [egunez] 25 C ingurukoak . |
Translation: | House plants need warm temperatures during active season: around 25C in [daylight]. |
Same sense: | False |
Coreference (EpecKorrefBin) | |
---|---|
Text: | Birmoldaketan daudenen artean [Katalunia , Madril , Hego Euskal Herria , Aragoi , Balear irlak eta Errioxa] aurkitzen dira . [Horien] artean , Hego Euskal Herriak 47.870 milioi pezeta jasoko ditu . |
Translation: | Among those under reconversion are [Catalonia, Madrid, Southern Basque Country, Aragon, Balearic islands and Rioja] . Among [them], the Southern Basque Country will receive 47,870 million pesetas. |
Coreference: | True |
For more details, each dataset is provided with their corresponding README file.
We provide an evaluation python script. Finetuning is left up to the user, script evaluates predictions provided on each task against test gold standards (test.jsonl files). The script expects the same format the datasets have.
python3 eval_basqueglue.py \
--task [nerc_id | nerc_od | intent | slot | bhtc | bec | vaxx | qnli | wic | coref] \
--pred prediction_file.jsonl \
--ref reference_file.jsonl #(usually test.jsonl)
We evaluated 2 language models, BERTeus and ElhBERTeu, finetuning them on each task independently. We used a lr of 3e-5 and a batch size of 32. We finetuned each model 5 times, up to 10 epochs, and choose the best performing checkpoint over validation split, to obtain the results on the test split on a single run. The results obtained on NERC are the average of in domain and out of domain NERC.
AVG | NERC | F_intent | F_slot | BHTC | BEC | Vaxx | QNLI | WiC | coref | |
---|---|---|---|---|---|---|---|---|---|---|
Model | F1 | F1 | F1 | F1 | F1 | MF1 | acc | acc | acc | |
BERTeus | 73.23 | 81.92 | 82.52 | 74.34 | 78.26 | 69.43 | 59.30 | 74.26 | 70.71 | 68.31 |
ElhBERTeu | 73.71 | 82.30 | 82.24 | 75.64 | 78.05 | 69.89 | 63.81 | 73.84 | 71.71 | 65.93 |
Gorka Urbizu [1], Iñaki San Vicente [1], Xabier Saralegi [1], Rodrigo Agerri [2] and Aitor Soroa [2]
Affiliation of the authors:
[1] orai NLP Technologies
[2] HiTZ Center - Ixa, University of the Basque Country UPV/EHU
Each dataset of the BasqueGLUE benchmark has it's own license (due to most of them being or being derived from already existing datasets). See their respective README files for details.
But, here we provide a brief summary of them:
Dataset | License |
---|---|
NERCid | CC BY-NC-SA 4.0 |
NERCood | CC BY-NC-SA 4.0 |
FMTODeu_intent | CC BY-NC-SA 4.0 |
FMTODeu_slot | CC BY-NC-SA 4.0 |
BHTCv2 | CC BY-NC-SA 4.0 |
BEC2016eu | Twitter's license + CC BY-NC-SA 4.0 |
VaxxStance | Twitter's license + CC BY 4.0 |
QNLIeu | CC BY-SA 4.0 |
WiCeu | CC BY-NC-SA 4.0 |
EpecKorrefBin | CC BY-NC-SA 4.0 |
For the rest of the files of the benchmark, including the evaluation script, the following license applies:
Copyright (C) by Orai NLP Technologies. This benchmark and evaluation scripts are licensed under the Creative Commons Attribution Share Alike 4.0 International License (CC BY-SA 4.0). To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
If you use this benchmark please cite the following paper:
- G. Urbizu, I. San Vicente, X. Saralegi, R. Agerri, A. Soroa. BasqueGLUE: A Natural Language Understanding Benchmark for Basque. In proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022). June, 2022. Marseille, France
@InProceedings{urbizu2022basqueglue,
author = {Urbizu, Gorka and San Vicente, Iñaki and Saralegi, Xabier and Agerri, Rodrigo and Soroa, Aitor},
title = {BasqueGLUE: A Natural Language Understanding Benchmark for Basque},
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
month = {June},
year = {2022},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {1603--1612},
abstract = {Natural Language Understanding (NLU) technology has improved significantly over the last few years and multitask benchmarks such as GLUE are key to evaluate this improvement in a robust and general way. These benchmarks take into account a wide and diverse set of NLU tasks that require some form of language understanding, beyond the detection of superficial, textual clues. However, they are costly to develop and language-dependent, and therefore they are only available for a small number of languages. In this paper, we present BasqueGLUE, the first NLU benchmark for Basque, a less-resourced language, which has been elaborated from previously existing datasets and following similar criteria to those used for the construction of GLUE and SuperGLUE. We also report the evaluation of two state-of-the-art language models for Basque on BasqueGLUE, thus providing a strong baseline to compare upon. BasqueGLUE is freely available under an open license.},
url = {https://aclanthology.org/2022.lrec-1.172}
}
Gorka Urbizu, Iñaki San Vicente: {g.urbizu,i.sanvicente}@orai.eus