XQA: A Cross-lingual Open-domain Question Answering Dataset

Jiahua Liu, Yankai Lin, Zhiyuan Liu, Maosong Sun

Abstract

Open-domain question answering (OpenQA) aims to answer questions through text retrieval and reading comprehension. Recently, lots of neural network-based models have been proposed and achieved promising results in OpenQA. However, the success of these models relies on a massive volume of training data (usually in English), which is not available in many other languages, especially for those low-resource languages. Therefore, it is essential to investigate cross-lingual OpenQA. In this paper, we construct a novel dataset XQA for cross-lingual OpenQA research. It consists of a training set in English as well as development and test sets in eight other languages. Besides, we provide several baseline systems for cross-lingual OpenQA, including two machine translation-based methods and one zero-shot cross-lingual method (multilingual BERT). Experimental results show that the multilingual BERT model achieves the best results in almost all target languages, while the performance of cross-lingual OpenQA is still much lower than that of English. Our analysis indicates that the performance of cross-lingual OpenQA is related to not only how similar the target language and English are, but also how difficult the question set of the target language is. The XQA dataset is publicly available at http://github.com/thunlp/XQA.

Anthology ID:: P19-1227
Volume:: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Month:: July
Year:: 2019
Address:: Florence, Italy
Editors:: Anna Korhonen, David Traum, Lluís Màrquez
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2358–2368
Language:
URL:: https://aclanthology.org/P19-1227
DOI:: 10.18653/v1/P19-1227
Bibkey:
Cite (ACL):: Jiahua Liu, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2019. XQA: A Cross-lingual Open-domain Question Answering Dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2358–2368, Florence, Italy. Association for Computational Linguistics.
Cite (Informal):: XQA: A Cross-lingual Open-domain Question Answering Dataset (Liu et al., ACL 2019)
Copy Citation:
PDF:: https://aclanthology.org/P19-1227.pdf
Code: thunlp/XQA
Data: XQA

PDF Cite Search Code