Computer Science > Computation and Language

arXiv:2407.00211 (cs)

[Submitted on 28 Jun 2024 (v1), last revised 6 Oct 2024 (this version, v2)]

Title:Detection and Measurement of Syntactic Templates in Generated Text

Authors:Chantal Shaib, Yanai Elazar, Junyi Jessy Li, Byron C. Wallace

Abstract:Recent work on evaluating the diversity of text generated by LLMs has focused on word-level features. Here we offer an analysis of syntactic features to characterize general repetition in models, beyond frequent n-grams. Specifically, we define syntactic templates and show that models tend to produce templated text in downstream tasks at a higher rate than what is found in human-reference texts. We find that most (76%) templates in model-generated text can be found in pre-training data (compared to only 35% of human-authored text), and are not overwritten during fine-tuning processes such as RLHF. This connection to the pre-training data allows us to analyze syntactic templates in models where we do not have the pre-training data. We also find that templates as features are able to differentiate between models, tasks, and domains, and are useful for qualitatively evaluating common model constructions. Finally, we demonstrate the use of templates as a useful tool for analyzing style memorization of training data in LLMs.

Comments:	EMNLP 2024
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2407.00211 [cs.CL]
	(or arXiv:2407.00211v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2407.00211

Submission history

From: Chantal Shaib [view email]
[v1] Fri, 28 Jun 2024 19:34:23 UTC (8,202 KB)
[v2] Sun, 6 Oct 2024 10:05:06 UTC (8,176 KB)

Computer Science > Computation and Language

Title:Detection and Measurement of Syntactic Templates in Generated Text

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Detection and Measurement of Syntactic Templates in Generated Text

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators