Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Sep 14, 2023
Date Accepted: Jun 25, 2024
Large Language Models Demonstrate Human-Comparable Sensitivity in Identifying Eligible Studies through Title and Abstract Screening: A three-layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews
ABSTRACT
Background:
The screening process for systematic reviews is resource-intensive. Although previous machine learning solutions have reported reductions in workload, there was a critical risk of excluding relevant papers.
Objective:
We evaluated the performance of a unique method using a large language model, GPT-3.5, to streamline the initial screening process for systematic reviews, and compared its sensitivity to traditional human screening.
Methods:
We used two of our previous systematic review articles related to the treatment of bipolar disorder. For the first article, we employed prompt engineering by referencing answers from a secondary screening. Following this, we used the prompts we created to implement screening for the second article. We applied both the previous (0301) and recent (0613) GPT-3.5 models to screen both articles and compared these assessments with those by human evaluators.
Results:
The 0301 and 0613 models were able to screen 20 and 50 records per minute, respectively. Neither model missed any records included in the meta-analysis. In the first study, the sensitivities of the 0301 and 0613 models were 86.7% and 77.4% respectively; in the second study, they were 97.9% and 81.3%, respectively. These results were equivalent to those of human evaluators: 86.7% to 100% for the first study and 77.6% to 97.9% for the second study.
Conclusions:
Our screening method using GPT-3.5 greatly outperformed previous machine learning-based methods and showed sensitivity close to human screening performance. Future studies should endeavor to generalize this methodology and validate its utility in various medical and non-medical settings.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.