[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Sep 14, 2023
Date Accepted: Jun 25, 2024

The final, peer-reviewed published version of this preprint can be found here:

Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews

Matsui K, Utsumi T, Aoki Y, Maruki T, Takeshima M, Takaesu Y

Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews

J Med Internet Res 2024;26:e52758

DOI: 10.2196/52758

PMID: 39151163

PMCID: 11364944

Large Language Models Demonstrate Human-Comparable Sensitivity in Identifying Eligible Studies through Title and Abstract Screening: A three-layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews

  • Kentaro Matsui; 
  • Tomohiro Utsumi; 
  • Yumi Aoki; 
  • Taku Maruki; 
  • Masahiro Takeshima; 
  • Yoshikazu Takaesu

ABSTRACT

Background:

The screening process for systematic reviews is resource-intensive. Although previous machine learning solutions have reported reductions in workload, there was a critical risk of excluding relevant papers.

Objective:

We evaluated the performance of a unique method using a large language model, GPT-3.5, to streamline the initial screening process for systematic reviews, and compared its sensitivity to traditional human screening.

Methods:

We used two of our previous systematic review articles related to the treatment of bipolar disorder. For the first article, we employed prompt engineering by referencing answers from a secondary screening. Following this, we used the prompts we created to implement screening for the second article. We applied both the previous (0301) and recent (0613) GPT-3.5 models to screen both articles and compared these assessments with those by human evaluators.

Results:

The 0301 and 0613 models were able to screen 20 and 50 records per minute, respectively. Neither model missed any records included in the meta-analysis. In the first study, the sensitivities of the 0301 and 0613 models were 86.7% and 77.4% respectively; in the second study, they were 97.9% and 81.3%, respectively. These results were equivalent to those of human evaluators: 86.7% to 100% for the first study and 77.6% to 97.9% for the second study.

Conclusions:

Our screening method using GPT-3.5 greatly outperformed previous machine learning-based methods and showed sensitivity close to human screening performance. Future studies should endeavor to generalize this methodology and validate its utility in various medical and non-medical settings.


 Citation

Please cite as:

Matsui K, Utsumi T, Aoki Y, Maruki T, Takeshima M, Takaesu Y

Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews

J Med Internet Res 2024;26:e52758

DOI: 10.2196/52758

PMID: 39151163

PMCID: 11364944

Download PDF


Request queued. Please wait while the file is being generated. It may take some time. Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.