Computer Science > Computation and Language

arXiv:2404.12224 (cs)

[Submitted on 18 Apr 2024 (v1), last revised 28 May 2024 (this version, v2)]

Title:Length Generalization of Causal Transformers without Position Encoding

Authors:Jie Wang, Tao Ji, Yuanbin Wu, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang, Xiaoling Wang

Abstract:Generalizing to longer sentences is important for recent Transformer-based language models. Besides algorithms manipulating explicit position features, the success of Transformers without position encodings (NoPE) provides a new way to overcome the challenge. In this paper, we study the length generalization property of NoPE. We find that although NoPE can extend to longer sequences than the commonly used explicit position encodings, it still has a limited context length. We identify a connection between the failure of NoPE's generalization and the distraction of attention distributions. We propose a parameter-efficient tuning for searching attention heads' best temperature hyper-parameters, which substantially expands NoPE's context size. Experiments on long sequence language modeling, the synthetic passkey retrieval task and real-world long context tasks show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms. The source code is publicly accessible

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2404.12224 [cs.CL]
	(or arXiv:2404.12224v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2404.12224

Submission history

From: Jie Wang [view email]
[v1] Thu, 18 Apr 2024 14:38:32 UTC (4,595 KB)
[v2] Tue, 28 May 2024 01:38:59 UTC (5,455 KB)

Computer Science > Computation and Language

Title:Length Generalization of Causal Transformers without Position Encoding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Length Generalization of Causal Transformers without Position Encoding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators