Abstract
Systematic reviews are crucial for evidence-based medicine as they comprehensively analyse published research findings on specific questions. Conducting such reviews is often resource- and time-intensive, especially in the screening phase, where abstracts of publications are assessed for inclusion in a review . This study investigates the effectiveness of using zero-shot large language models (LLMs) for automatic screening. We evaluate the effectiveness of eight different LLMs and investigate a calibration technique that uses a predefined recall threshold to determine whether a publication should be included in a systematic review. Our comprehensive evaluation using five standard test collections shows that instruction fine-tuning plays an important role in screening, that calibration renders LLMs practical for achieving a targeted recall, and that combining both with an ensemble of zero-shot models saves significant screening time compared to state-of-the-art approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Other commonly used terms are ‘studies’, ‘research publications’, and ‘references’.
- 2.
- 3.
Although the dataset is described to be public, it currently only contains the DOIs of the systematic review topics but not the labels, making reproduction difficult.
- 4.
- 5.
Note that for consistency of the paper, we name all instruction-tuned models with -ins; The original names are: Alpaca-7b-ins: alpaca; Guanaco-7b-ins: guanaco-7b; Falcon-7b-ins: falcon-7b-instruct; LlaMa2-7b-ins: LlaMa2-7b-chat; LlaMa2-13b-ins: LlaMa2-13b-chat;.
- 6.
We removed topic 18 as no relevant document exited in the candidate document list (the topic only contains one relevant document).
- 7.
In the uncalibrated setting for BioBERT, we established a decision threshold of 0.5 to determine the inclusion of a document in a review topic. Specifically, a document is included if the BioBERT output satisfies the condition \(output \ge 0.5\); otherwise, it is excluded.
- 8.
Comparison is however not straightforward as Bio-SIEVE used most of the datasets we consider here for fine-tuning; we then evaluate effectiveness using the only 27 topics from CLEF-TAR that were not used to fine-tune Bio-SIEVE.
- 9.
References
Abualsaud, M., Ghelani, N., Zhang, H., Smucker, M.D., Cormack, G.V., Grossman, M.R.: A system for efficient high-recall retrieval. In: Proceedings of the 41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1317–1320 (2018)
Alharbi, A., Briggs, W., Stevenson, M.: retrieving and ranking studies for systematic reviews: University of Sheffield’s Approach to CLEF eHealth 2018 Task 2. In: CEUR Workshop Proceedings: Working Notes of CLEF 2018: Conference and Labs of the Evaluation Forum. vol. 2125. CEUR Workshop Proceedings (2018)
Alharbi, A., Stevenson, M.: Ranking abstracts to identify relevant evidence for systematic reviews: the university of sheffield’s approach to clef ehealth 2017 Task 2. In: CEUR Workshop Proceedings: Working Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)
Alshami, A., Elsayed, M., Ali, E., Eltoukhy, A.E., Zayed, T.: Harnessing the power of chatgpt for automating systematic review process: methodology, case study, limitations, and future directions. Systems 11(7), 351 (2023)
Anagnostou, A., Lagopoulos, A., Tsoumakas, G., Vlahavas, I.P.: Combining inter-review learning-to-rank and intra-review incremental training for title and abstract screening in systematic reviews. In: CEUR Workshop Proceedings: Working Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)
Aum, S., Choe, S.: srbert: automatic article classification model for systematic review using BERT. Syst. Contr. Found. Appl. 10(1), 1–8 (2021)
Bramer, W.M., Rethlefsen, M.L., Kleijnen, J., Franco, O.H.: Optimal database combinations for literature searches in systematic reviews: a prospective exploratory study. Syst. Contr. Found. Appl. 6, 1–12 (2017)
Callaghan, M.W., Müller-Hansen, F.: Statistical stopping criteria for automated screening in systematic reviews. Syst. Contr. Found. Appl. 9(1), 1–14 (2020)
Carvallo, A., Parra, D., Lobel, H., Soto, A.: Automatic document screening of medical literature using word and text embeddings in an active learning setting. Scientometrics 125, 3047–3084 (2020)
Carvallo, A., Parra, D., Rada, G., Perez, D., Vasquez, J.I., Vergara, C.: Neural language models for text classification in evidence-based medicine. arXiv preprint arXiv:2012.00584 (2020)
Chandler, J., Cumpston, M., Li, T., Page, M.J., Welch, V.A.: Cochrane Handbook for Systematic Reviews of Interventions. John Wiley & Sons (2019)
Chen, J., et al.: ECNU at 2017 eHealth task 2: technologically assisted reviews in empirical medicine. In: CEUR Workshop Proceedings: Working Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)
Chiang, W.L., et al.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (Accessed 14 April 2023) (2023)
Clark, J.: Systematic reviewing: introduction, locating studies and data abstraction. In: Doi, S.A.R., Williams, G.M. (eds.) Methods of Clinical Epidemiology, pp. 187–211. Springer Berlin Heidelberg, Berlin, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37131-8_12
Cohen, A.M., Ambert, K., McDonagh, M.: A prospective evaluation of an automated classification system to support evidence-based medicine and systematic review. In: AMIA annual symposium proceedings. vol. 2010, p. 121. American Medical Informatics Association (2010)
Cohen, A., Hersh, W., Peterson, K., Yen, P.: Reducing workload in systematic review preparation using automated citation classification. J. Am. Med. Inform. Assoc. 13(2), 206–219 (2006)
Collaboration, C.: The cochrane library. Database available on disk and CDROM. Oxford, UK, Update Software (2002)
Crumley, E.T., Wiebe, N., Cramer, K., Klassen, T.P., Hartling, L.: Which resources should be used to identify rct/ccts for systematic reviews: a systematic review. BMC Med. Res. Methodol. 5, 1–13 (2005)
Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314 (2023)
Di Nunzio, G.M., Beghini, F., Vezzani, F., Henrot, G.: An interactive two-dimensional approach to query aspects rewriting in systematic reviews. IMS unipd at CLEF eHealth task 2. In: CEUR Workshop Proceedings: Working Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)
Di Nunzio, G.M., Ciuffreda, G., Vezzani, F.: Interactive sampling for systematic reviews. IMS unipd at CLEF 2018 eHealth task 2. In: CEUR Workshop Proceedings: Working Notes of CLEF 2018: Conference and Labs of the Evaluation Forum (2018)
Kanoulas, E., Li, D., Azzopardi, L., Spijker, R.: CLEF 2017 technologically assisted reviews in empirical medicine overview. In: CEUR Workshop Proceedings: Working Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)
Kanoulas, E., Li, D., Azzopardi, L., Spijker, R.: CLEF 2019 technology assisted reviews in empirical medicine overview. In: CEUR Workshop Proceedings: Working Notes of CLEF 2018: Conference and Labs of the Evaluation Forum. vol. 2380 (2019)
Kanoulas, E., Spijker, R., Li, D., Azzopardi, L.: CLEF 2018 technology assisted reviews in empirical medicine overview. In: CEUR Workshop Proceedings: Working Notes of CLEF 2018: Conference and Labs of the Evaluation Forum (2018)
Köpf, A., Kilcher, Y., et al.: Openassistant conversations-democratizing large language model alignment. arXiv preprint arXiv:2304.07327 (2023)
Kozorovitsky, A.K., Kurland, O.: From"identical"to"similar": fusing retrieved lists based on inter-document similarities. J. Artif. Intell. Res. 41, 267–296 (2011)
Lagopoulos, A., Anagnostou, A., Minas, A., Tsoumakas, G.: Learning-to-rank and relevance feedback for literature appraisal in empirical medicine. In: Bellot, P., et al. (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction: 9th International Conference of the CLEF Association, CLEF 2018, Avignon, France, September 10-14, 2018, Proceedings, pp. 52–63. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-319-98932-7_5
Lee, G.E., Sun, A.: Seed-driven document ranking for systematic reviews in evidence-based medicine. In: Proceedings of the 41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 455–464 (2018)
Lee, J., et al.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020)
Lu, Y., Bartolo, M., Moore, A., Riedel, S., Stenetorp, P.: Fantastically ordered prompts and where to find them: overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786 (2021)
Minas, A., Lagopoulos, A., Tsoumakas, G.: Aristotle university’s approach to the technologically assisted reviews in empirical medicine task of the 2018 CLEF eHealth lab. In: CEUR Workshop Proceedings: Working Notes of CLEF 2018: Conference and Labs of the Evaluation Forum (2018)
Miwa, M., Thomas, J., O’Mara-Eves, A., Ananiadou, S.: Reducing systematic review workload through certainty-based screening. J. Biomed. Inform. 51, 242–253 (2014)
Norman, C.R., Leeflang, M.M., Porcher, R., Névéol, A.: Measuring the impact of screening automation on meta-analyses of diagnostic test accuracy. Syst. Contr. Found. Appl. 8(1), 243 (2019)
Penedo, G., et al.: The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023)
Robinson, A., et al.: Bio-sieve: exploring instruction tuning large language models for systematic review automation. arXiv preprint arXiv:2308.06610 (2023)
Scells, H., Zuccon, G.: You can teach an old dog new tricks: rank fusion applied to coordination level matching for ranking in systematic reviews. In: Proceedings of the 42nd European Conference on Information Retrieval, pp. 399–414 (2020)
Scells, H., Zuccon, G., Deacon, A., Koopman, B.: QUT ielab at CLEF eHealth 2017 technology assisted reviews track: initial experiments with learning to rank. In: CEUR Workshop Proceedings: Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)
Scells, H., Zuccon, G., Koopman, B.: Automatic boolean query refinement for systematic review literature search. In: Proceedings of the 28th World Wide Web Conference, pp. 1646–1656 (2019)
Scells, H., Zuccon, G., Koopman, B.: A comparison of automatic boolean query formulation for systematic reviews. Information Retrieval Journal, pp. 1–26 (2020)
Scells, H., Zuccon, G., Koopman, B.: A computational approach for objectively derived systematic review search strategies. In: Proceedings of the 42nd European Conference on Information Retrieval, pp. 385–398 (2020)
Scells, H., Zuccon, G., Koopman, B., Clark, J.: Automatic boolean query formulation for systematic review literature search. In: Proceedings of the 29th World Wide Web Conference, pp. 1071–1081 (2020)
Singh, J., Thomas, L.: IIIT-H at CLEF eHealth 2017 task 2: Technologically assisted reviews in empirical medicine. In: CEUR Workshop Proceedings: Working Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)
Syriani, E., David, I., Kumar, G.: Assessing the ability of chatgpt to screen articles for systematic reviews. arXiv preprint arXiv:2307.06464 (07 2023)
Taori, R., et al.: Stanford alpaca: an instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca (2023)
Thomas, J., Harden, A.: Methods for the thematic synthesis of qualitative research in systematic reviews. BMC Med. Res. Methodol. 8(1), 45 (2008)
Touvron, H., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Touvron, H., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Wallace, B.C., Small, K., Brodley, C.E., Lau, J., Trikalinos, T.A.: Deploying an interactive machine learning system in an evidence-based practice center: Abstrackr. In: Proceedings of the 2nd ACM International Health Informatics Symposium, pp. 819–824 (2012)
Wallace, B.C., Trikalinos, T.A., Lau, J., Brodley, C., Schmid, C.H.: Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinform. 11(1), 55 (2010)
Wang, S., Li, H., Scells, H., Locke, D., Zuccon, G.: Mesh term suggestion for systematic review literature search. In: Proceedings of the 25th Australasian Document Computing Symposium, pp. 1–8 (2021)
Wang, S., Li, H., Zuccon, G.: Mesh suggester: a library and system for mesh term suggestion for systematic review boolean query construction. In: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pp. 1176–1179 (2023)
Wang, S., Scells, H., Clark, J., Koopman, B., Zuccon, G.: From little things big things grow: A collection with seed studies for medical systematic review literature search. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3176–3186 (2022)
Wang, S., Scells, H., Koopman, B., Zuccon, G.: Automated mesh term suggestion for effective query formulation in systematic reviews literature search. Intell. Syst. Appl. 200141 (2022)
Wang, S., Scells, H., Koopman, B., Zuccon, G.: Neural rankers for effective screening prioritisation in medical systematic review literature search. In: Proceedings of the 26th Australasian Document Computing Symposium, pp. 1–10 (2022)
Wang, S., Scells, H., Koopman, B., Zuccon, G.: Can chatgpt write a good boolean query for systematic review literature search? In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1426–1436. SIGIR ’23, Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3539618.3591703
Wang, S., Scells, H., Potthast, M., Koopman, B., Zuccon, G.: Generating natural language queries for more effective systematic review screening prioritisation. arXiv preprint arXiv:2309.05238 (2023)
Wang, Y., et al.: Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560 (2022)
White, J.: Pubmed 2.0. Medical reference services quarterly 39(4), 382–387 (2020)
Wu, H., Wang, T., Chen, J., Chen, S., Hu, Q., He, L.: Ecnu at 2018 ehealth task 2: technologically assisted reviews in empirical medicine. Methods-a Companion Methods Enzymol. 4(5), 7 (2018)
Xu, Y., et al.: Qa-lora: Quantization-aware low-rank adaptation of large language models. arXiv preprint arXiv:2309.14717 (2023)
Yang, C., et al.: Large language models as optimizers. arXiv preprint arXiv:2309.03409 (2023)
Yang, E., MacAvaney, S., Lewis, D.D., Frieder, O.: Goldilocks: just-right tuning of BERT for technology-assisted review. In: Hagen, M., et al. (eds.) ECIR 2022. LNCS, vol. 13185, pp. 502–517. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99736-6_34
Zhang, R., Wang, Y.S., Yang, Y.: Generation-driven contrastive self-training for zero-shot text classification with instruction-tuned gpt. arXiv preprint arXiv:2304.11872 (2023)
Zhao, Z., Wallace, E., Feng, S., Klein, D., Singh, S.: Calibrate before use: Improving few-shot performance of language models. In: International Conference on Machine Learning, pp. 12697–12706. PMLR (2021)
Zou, J., Li, D., Kanoulas, E.: Technology assisted reviews: finding the last few relevant documents by asking Yes/No questions to reviewers. In: Proceedings of the 41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 949–952 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, S., Scells, H., Zhuang, S., Potthast, M., Koopman, B., Zuccon, G. (2024). Zero-Shot Generative Large Language Models for Systematic Review Screening Automation. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14608. Springer, Cham. https://doi.org/10.1007/978-3-031-56027-9_25
Download citation
DOI: https://doi.org/10.1007/978-3-031-56027-9_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56026-2
Online ISBN: 978-3-031-56027-9
eBook Packages: Computer ScienceComputer Science (R0)