Computer Science > Computation and Language

arXiv:2306.14377 (cs)

[Submitted on 26 Jun 2023]

Title:Synthetic Alone: Exploring the Dark Side of Synthetic Data for Grammatical Error Correction

Authors:Chanjun Park, Seonmin Koo, Seolhwa Lee, Jaehyung Seo, Sugyeong Eo, Hyeonseok Moon, Heuiseok Lim

View PDF

Abstract:Data-centric AI approach aims to enhance the model performance without modifying the model and has been shown to impact model performance positively. While recent attention has been given to data-centric AI based on synthetic data, due to its potential for performance improvement, data-centric AI has long been exclusively validated using real-world data and publicly available benchmark datasets. In respect of this, data-centric AI still highly depends on real-world data, and the verification of models using synthetic data has not yet been thoroughly carried out. Given the challenges above, we ask the question: Does data quality control (noise injection and balanced data), a data-centric AI methodology acclaimed to have a positive impact, exhibit the same positive impact in models trained solely with synthetic data? To address this question, we conducted comparative analyses between models trained on synthetic and real-world data based on grammatical error correction (GEC) task. Our experimental results reveal that the data quality control method has a positive impact on models trained with real-world data, as previously reported in existing studies, while a negative impact is observed in models trained solely on synthetic data.

Comments:	Accepted for Data-centric Machine Learning Research (DMLR) Workshop at ICML 2023
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2306.14377 [cs.CL]
	(or arXiv:2306.14377v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2306.14377

Submission history

From: Chanjun Park [view email]
[v1] Mon, 26 Jun 2023 01:40:28 UTC (1,548 KB)

Computer Science > Computation and Language

Title:Synthetic Alone: Exploring the Dark Side of Synthetic Data for Grammatical Error Correction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Synthetic Alone: Exploring the Dark Side of Synthetic Data for Grammatical Error Correction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators