Computer Science > Computation and Language

arXiv:2305.14263 (cs)

[Submitted on 23 May 2023 (v1), last revised 6 Nov 2023 (this version, v2)]

Title:LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages

Authors:Milind Agarwal, Md Mahfuz Ibn Alam, Antonios Anastasopoulos

View PDF

Abstract:Knowing the language of an input text/audio is a necessary first step for using almost every NLP tool such as taggers, parsers, or translation systems. Language identification is a well-studied problem, sometimes even considered solved; in reality, due to lack of data and computational challenges, current systems cannot accurately identify most of the world's 7000 languages. To tackle this bottleneck, we first compile a corpus, MCS-350, of 50K multilingual and parallel children's stories in 350+ languages. MCS-350 can serve as a benchmark for language identification of short texts and for 1400+ new translation directions in low-resource Indian and African languages. Second, we propose a novel misprediction-resolution hierarchical model, LIMIt, for language identification that reduces error by 55% (from 0.71 to 0.32) on our compiled children's stories dataset and by 40% (from 0.23 to 0.14) on the FLORES-200 benchmark. Our method can expand language identification coverage into low-resource languages by relying solely on systemic misprediction patterns, bypassing the need to retrain large models from scratch.

Comments:	To appear at EMNLP 2023. 24 pages, 2 figures, 12 tables
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2305.14263 [cs.CL]
	(or arXiv:2305.14263v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2305.14263

Submission history

From: Milind Agarwal [view email]
[v1] Tue, 23 May 2023 17:15:43 UTC (893 KB)
[v2] Mon, 6 Nov 2023 16:29:21 UTC (381 KB)

Computer Science > Computation and Language

Title:LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:LIMIT: Language Identification, Misidentification, and Translation using Hierarchical Models in 350+ Languages

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators