[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3491140.3528315acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesl-at-sConference Proceedingsconference-collections
short-paper

Where are the Large N Studies in Education?: Introducing a Dataset of Scientific Articles and NLP Techniques

Published: 01 June 2022 Publication History

Abstract

Research, especially in Education, is hampered by scale and our aim is to help shift this dynamic by tracking relevant studies from major scientific venues. The current version of our dataset considers top conferences and journals from the domain (N = 33,531). Several heuristics using advanced text searches and regular expressions, together with NLP techniques ranging from part of speech tagging, syntactic dependency parsing, to semantic models, are employed to extract relevant studies. Our method achieves an F1 score of .633 when employed on a manually annotated subset of 1,000 articles. When applied to the entire dataset, the total number of articles with large N was around 10%, with a positive trend in the last years. L@S was by far the venue with the highest density of identified articles, thus arguing for its emphasis on scale. Further filtering of the articles is required when focusing only on learning, as the articles span across multiple domains and have multiple interests. Nonetheless, the order of magnitude raises current problems, namely that these studies are scarce and future endeavors should emphasize the importance of scale. Our dataset, including the validation subset, and the corresponding code for data crawling, PDF processing, and large N extraction mechanisms, have been open-sourced to further support the initiative and stimulate new analyses.

Supplementary Material

MP4 File (lswp166.mp4)
This is a video presentation of the WiP paper called "Where are the Large N Studies in Education? Introducting a Dataset of Scientific Articles and NLP Techniques". The first author, Dragos Corlatescu, is presenting the objective of the study alongside the descriptions of the datasets used and the methods employed. The number of large scale studies (i.e., number of studied entities are equal or greater than 1000) is quite low and to prove this statement the authors started by presenting the details of building a corpus with articles from top venues in Education and Psychology. Next, they detailed the method that involved regular expressions and language models to detect large scale studies. The validation of the models was conducted on a manually annotated dataset of 1,000 articles and the best model obtain an F1 score of 0.633. The findings suggest that around 10% of the total available documents are studies with a large N.

References

[1]
Frances Boyle and Damien Sherman. 2006. Scopus?: The product and its development. The Serials Librarian 49, 3 (2006), 147--153.
[2]
Hendrik Drachsler, Stefan Dietze, Eelco Herder, Mathieu d'Aquin, and Davide Taibi. 2014. The learning analytics & knowledge (LAK) data challenge 2014. In Proceedings of the fourth international conference on learning analytics and knowledge. ACM, 289--290. https://doi.org/10.1145/2567574.2567630
[3]
Explosion. n.d. SpaCy English Models. https://spacy.io/models/en "Accessed on 2022-02--20".
[4]
Facebook. n.d. Duckling. https://github.com/facebook/duckling "Accessed on 2022-02--20".
[5]
Suzanne Fricke. 2018. Semantic scholar. Journal of the Medical Library Association: JMLA 106, 1 (2018), 145. https://doi.org/10.5195/jmla.2018.280
[6]
Alejandro Gallo. n.d. Scihub Python. https://pypi.org/project/scihub/ "Accessed 2022-02-01".
[7]
Google. n.d. Google Scholar. https://scholar.google.com/ "Accessed 2022-02--20".
[8]
Clinton Gormley and Zachary Tong. 2015. Elasticsearch: The definitive guide: A distributed real-time search and analytics engine. O'Reilly Media, Inc.
[9]
Matthew Honnibal and Ines Montani. 2017. spacy 2: Natural language understanding with bloom embeddings. Convolutional Neural Networks and Incremental Parsing 7, 1 (2017).
[10]
ITHAKA. n.d. JSTOR. https://www.jstor.org/ "Accessed on 2022-02--20".
[11]
The Lens. n.d. The lens - free & open patent and scholarly search. https://www.lens.org/ "Accessed: 2022-02--20".
[12]
Leonard Richardson. n.d. BeautifulSoup. https://pypi.org/project/beautifulsoup4/ "Accessed 2022-01--18".
[13]
Mihai Dascalu; Stefan Ruseti; Irina Toma; Dragos, Corlatescu. 2022. LargeN Metadata Corpus. https://www.largenineducation.com/es_meta.json "Accessed 2022-02--20".
[14]
Mihai Dascalu; Stefan Ruseti; Irina Toma; Dragos, Corlatescu. 2022. LargeN Validation Corpus. https://www.largenineducation.com/validation_corpus.json "Accessed 2022-02--20".

Cited By

View all
  • (2023)Asking Questions about Scientific Articles—Identifying Large N Studies with LLMsElectronics10.3390/electronics1219399612:19(3996)Online publication date: 22-Sep-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
L@S '22: Proceedings of the Ninth ACM Conference on Learning @ Scale
June 2022
491 pages
ISBN:9781450391580
DOI:10.1145/3491140
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. large n studies
  2. natural language processing techniques
  3. pdf article processing

Qualifiers

  • Short-paper

Funding Sources

  • Schmidt Futures

Conference

L@S '22
L@S '22: Ninth (2022) ACM Conference on Learning @ Scale
June 1 - 3, 2022
NY, New York City, USA

Acceptance Rates

Overall Acceptance Rate 117 of 440 submissions, 27%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)2
Reflects downloads up to 19 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Asking Questions about Scientific Articles—Identifying Large N Studies with LLMsElectronics10.3390/electronics1219399612:19(3996)Online publication date: 22-Sep-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media