Computer Science > Computation and Language

arXiv:2310.15799 (cs)

[Submitted on 24 Oct 2023]

Title:DALE: Generative Data Augmentation for Low-Resource Legal NLP

Authors:Sreyan Ghosh, Chandra Kiran Evuru, Sonal Kumar, S Ramaneswaran, S Sakshi, Utkarsh Tyagi, Dinesh Manocha

View PDF

Abstract:We present DALE, a novel and effective generative Data Augmentation framework for low-resource LEgal NLP. DALE addresses the challenges existing frameworks pose in generating effective data augmentations of legal documents - legal language, with its specialized vocabulary and complex semantics, morphology, and syntax, does not benefit from data augmentations that merely rephrase the source sentence. To address this, DALE, built on an Encoder-Decoder Language Model, is pre-trained on a novel unsupervised text denoising objective based on selective masking - our masking strategy exploits the domain-specific language characteristics of templatized legal documents to mask collocated spans of text. Denoising these spans helps DALE acquire knowledge about legal concepts, principles, and language usage. Consequently, it develops the ability to generate coherent and diverse augmentations with novel contexts. Finally, DALE performs conditional generation to generate synthetic augmentations for low-resource Legal NLP tasks. We demonstrate the effectiveness of DALE on 13 datasets spanning 6 tasks and 4 low-resource settings. DALE outperforms all our baselines, including LLMs, qualitatively and quantitatively, with improvements of 1%-50%.

Comments:	Accepted to EMNLP 2023 Main Conference. Code: this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2310.15799 [cs.CL]
	(or arXiv:2310.15799v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2310.15799

Submission history

From: Sreyan Ghosh [view email]
[v1] Tue, 24 Oct 2023 12:50:28 UTC (2,107 KB)

Computer Science > Computation and Language

Title:DALE: Generative Data Augmentation for Low-Resource Legal NLP

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:DALE: Generative Data Augmentation for Low-Resource Legal NLP

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators