Computer Science > Computation and Language

arXiv:2203.15349 (cs)

[Submitted on 29 Mar 2022 (v1), last revised 1 Apr 2022 (this version, v2)]

Title:LDKP: A Dataset for Identifying Keyphrases from Long Scientific Documents

Authors:Debanjan Mahata, Navneet Agarwal, Dibya Gautam, Amardeep Kumar, Swapnil Parekh, Yaman Kumar Singla, Anish Acharya, Rajiv Ratn Shah

View PDF

Abstract:Identifying keyphrases (KPs) from text documents is a fundamental task in natural language processing and information retrieval. Vast majority of the benchmark datasets for this task are from the scientific domain containing only the document title and abstract information. This limits keyphrase extraction (KPE) and keyphrase generation (KPG) algorithms to identify keyphrases from human-written summaries that are often very short (approx 8 sentences). This presents three challenges for real-world applications: human-written summaries are unavailable for most documents, the documents are almost always long, and a high percentage of KPs are directly found beyond the limited context of title and abstract. Therefore, we release two extensive corpora mapping KPs of ~1.3M and ~100K scientific articles with their fully extracted text and additional metadata including publication venue, year, author, field of study, and citations for facilitating research on this real-world problem.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2203.15349 [cs.CL]
	(or arXiv:2203.15349v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2203.15349

Submission history

From: Yaman Kumar Singla [view email]
[v1] Tue, 29 Mar 2022 08:44:57 UTC (5,283 KB)
[v2] Fri, 1 Apr 2022 08:24:39 UTC (5,283 KB)

Computer Science > Computation and Language

Title:LDKP: A Dataset for Identifying Keyphrases from Long Scientific Documents

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:LDKP: A Dataset for Identifying Keyphrases from Long Scientific Documents

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators