8000 BasqueGLUE/nerc_od at main · orai-nlp/BasqueGLUE · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Latest commit

 

History

History

nerc_od

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
============================================================
 Basque Named Entities Corpus for out of domain Basque NERC
============================================================

This dataset contains sentences with manually annotated named entities. 
The training data is the merge of EIEC (a dataset of a collection of news 
wire articles from Euskaldunon Egunkaria newspaper, (Alegria et al. 2004)), 
and newly annotated data from naiz.eus. For validation and test sets, sentences 
from Wikipedia were annotated following the same annotation guidelines.


# Dataset format and distribution
# ----------------

# The dataset is divided into three files: train, test and validation splits. 

64,475 train.jsonl (News)
14,945 val.jsonl (Wiki)
14,462 test.jsonl (Wiki)

*sizes in tokens

Tagged named entities are classified into 4 categories: person (PER),
location (LOC), organization (ORG) and other (MISC) named entities that
do not belong to the previous 3 groups.


Authors
-----------
Gorka Urbizu, Iñaki San Vicente and Xabier Saralegi


Affiliation of the authors: 
Elhuyar Foundation




Licensing
-------------
Copyright (C) by Elhuyar Foundation. 
This resource is distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC-BY-NC-SA). 
The full details of this license can be found at http://creativecommons.org/licenses/by/4.0/legalcode





Acknowledgements
-------------------
If you use this dataset please cite the following paper:

- G. Urbizu, I. San Vicente, X. Saralegi, R. Agerri, A. Soroa. BasqueGLUE: A Natural Language Understanding Benchmark for Basque. In proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022). June, 2022. Marseille, France

 



Contact information
-----------------------
Gorka Urbizu, Iñaki San Vicente: {g.urbizu,i.sanvicente}@elhuyar.eus




References
-------------

I. Alegria, O. Arregi, I. Balza, N. Ezeiza, I. Fernandez,
R. Urizar. Design and Development of a Named Entity Recognizer for an
Agglutinative Language. In: First International Joint Conference on
NLP (IJC NLP04), Workshop on Named Entity Recognition. 2004
0