Computer Science > Computation and Language

arXiv:2210.02956 (cs)

[Submitted on 6 Oct 2022]

Title:Are word boundaries useful for unsupervised language learning?

Authors:Tu Anh Nguyen, Maureen de Seyssel, Robin Algayres, Patricia Roze, Ewan Dunbar, Emmanuel Dupoux

View PDF

Abstract:Word or word-fragment based Language Models (LM) are typically preferred over character-based ones in many downstream applications. This may not be surprising as words seem more linguistically relevant units than characters. Words provide at least two kinds of relevant information: boundary information and meaningful units. However, word boundary information may be absent or unreliable in the case of speech input (word boundaries are not marked explicitly in the speech stream). Here, we systematically compare LSTMs as a function of the input unit (character, phoneme, word, word part), with or without gold boundary information. We probe linguistic knowledge in the networks at the lexical, syntactic and semantic levels using three speech-adapted black box NLP psycholinguistically-inpired benchmarks (pWUGGY, pBLIMP, pSIMI). We find that the absence of boundaries costs between 2\% and 28\% in relative performance depending on the task. We show that gold boundaries can be replaced by automatically found ones obtained with an unsupervised segmentation algorithm, and that even modest segmentation performance gives a gain in performance on two of the three tasks compared to basic character/phone based models without boundary information.

Comments:	This is an archived version from September 2020
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2210.02956 [cs.CL]
	(or arXiv:2210.02956v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2210.02956

Submission history

From: Tu Anh Nguyen [view email]
[v1] Thu, 6 Oct 2022 14:49:42 UTC (633 KB)

Computer Science > Computation and Language

Title:Are word boundaries useful for unsupervised language learning?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Are word boundaries useful for unsupervised language learning?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators