Zero-Shot Generative Large Language Models for Systematic Review Screening Automation

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14608))

Included in the following conference series:

European Conference on Information Retrieval

Abstract

Systematic reviews are crucial for evidence-based medicine as they comprehensively analyse published research findings on specific questions. Conducting such reviews is often resource- and time-intensive, especially in the screening phase, where abstracts of publications are assessed for inclusion in a review . This study investigates the effectiveness of using zero-shot large language models (LLMs) for automatic screening. We evaluate the effectiveness of eight different LLMs and investigate a calibration technique that uses a predefined recall threshold to determine whether a publication should be included in a systematic review. Our comprehensive evaluation using five standard test collections shows that instruction fine-tuning plays an important role in screening, that calibration renders LLMs practical for achieving a targeted recall, and that combining both with an ensemble of zero-shot models saves significant screening time compared to state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 54.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 69.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Closing the gap between open source and commercial large language models for medical evidence summarization

Article Open access 09 September 2024

Evaluating the effectiveness of large language models in abstract screening: a comparative analysis

Article Open access 21 August 2024

A Study of an Automatic Stopping Strategy for Technologically Assisted Medical Reviews

Notes

1.
Other commonly used terms are ‘studies’, ‘research publications’, and ‘references’.
2.
https://chat.openai.com/.
3.
Although the dataset is described to be public, it currently only contains the DOIs of the systematic review topics but not the labels, making reproduction difficult.
4.
https://platform.openai.com/docs/models/gpt-3-5.
5.
Note that for consistency of the paper, we name all instruction-tuned models with -ins; The original names are: Alpaca-7b-ins: alpaca; Guanaco-7b-ins: guanaco-7b; Falcon-7b-ins: falcon-7b-instruct; LlaMa2-7b-ins: LlaMa2-7b-chat; LlaMa2-13b-ins: LlaMa2-13b-chat;.
6.
We removed topic 18 as no relevant document exited in the candidate document list (the topic only contains one relevant document).
7.
In the uncalibrated setting for BioBERT, we established a decision threshold of 0.5 to determine the inclusion of a document in a review topic. Specifically, a document is included if the BioBERT output satisfies the condition \(output \ge 0.5\); otherwise, it is excluded.
8.
Comparison is however not straightforward as Bio-SIEVE used most of the datasets we consider here for fine-tuning; we then evaluate effectiveness using the only 27 topics from CLEF-TAR that were not used to fine-tune Bio-SIEVE.
9.
https://github.com/ielab/ECIR-2024-llm-screening.

References

Abualsaud, M., Ghelani, N., Zhang, H., Smucker, M.D., Cormack, G.V., Grossman, M.R.: A system for efficient high-recall retrieval. In: Proceedings of the 41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1317–1320 (2018)
Google Scholar
Alharbi, A., Briggs, W., Stevenson, M.: retrieving and ranking studies for systematic reviews: University of Sheffield’s Approach to CLEF eHealth 2018 Task 2. In: CEUR Workshop Proceedings: Working Notes of CLEF 2018: Conference and Labs of the Evaluation Forum. vol. 2125. CEUR Workshop Proceedings (2018)
Google Scholar
Alharbi, A., Stevenson, M.: Ranking abstracts to identify relevant evidence for systematic reviews: the university of sheffield’s approach to clef ehealth 2017 Task 2. In: CEUR Workshop Proceedings: Working Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)
Google Scholar
Alshami, A., Elsayed, M., Ali, E., Eltoukhy, A.E., Zayed, T.: Harnessing the power of chatgpt for automating systematic review process: methodology, case study, limitations, and future directions. Systems 11(7), 351 (2023)
Article Google Scholar
Anagnostou, A., Lagopoulos, A., Tsoumakas, G., Vlahavas, I.P.: Combining inter-review learning-to-rank and intra-review incremental training for title and abstract screening in systematic reviews. In: CEUR Workshop Proceedings: Working Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)
Google Scholar
Aum, S., Choe, S.: srbert: automatic article classification model for systematic review using BERT. Syst. Contr. Found. Appl. 10(1), 1–8 (2021)
Google Scholar
Bramer, W.M., Rethlefsen, M.L., Kleijnen, J., Franco, O.H.: Optimal database combinations for literature searches in systematic reviews: a prospective exploratory study. Syst. Contr. Found. Appl. 6, 1–12 (2017)
Google Scholar
Callaghan, M.W., Müller-Hansen, F.: Statistical stopping criteria for automated screening in systematic reviews. Syst. Contr. Found. Appl. 9(1), 1–14 (2020)
Google Scholar
Carvallo, A., Parra, D., Lobel, H., Soto, A.: Automatic document screening of medical literature using word and text embeddings in an active learning setting. Scientometrics 125, 3047–3084 (2020)
Article Google Scholar
Carvallo, A., Parra, D., Rada, G., Perez, D., Vasquez, J.I., Vergara, C.: Neural language models for text classification in evidence-based medicine. arXiv preprint arXiv:2012.00584 (2020)
Chandler, J., Cumpston, M., Li, T., Page, M.J., Welch, V.A.: Cochrane Handbook for Systematic Reviews of Interventions. John Wiley & Sons (2019)
Google Scholar
Chen, J., et al.: ECNU at 2017 eHealth task 2: technologically assisted reviews in empirical medicine. In: CEUR Workshop Proceedings: Working Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)
Google Scholar
Chiang, W.L., et al.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (Accessed 14 April 2023) (2023)
Google Scholar
Clark, J.: Systematic reviewing: introduction, locating studies and data abstraction. In: Doi, S.A.R., Williams, G.M. (eds.) Methods of Clinical Epidemiology, pp. 187–211. Springer Berlin Heidelberg, Berlin, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37131-8_12
Chapter Google Scholar
Cohen, A.M., Ambert, K., McDonagh, M.: A prospective evaluation of an automated classification system to support evidence-based medicine and systematic review. In: AMIA annual symposium proceedings. vol. 2010, p. 121. American Medical Informatics Association (2010)
Google Scholar
Cohen, A., Hersh, W., Peterson, K., Yen, P.: Reducing workload in systematic review preparation using automated citation classification. J. Am. Med. Inform. Assoc. 13(2), 206–219 (2006)
Article Google Scholar
Collaboration, C.: The cochrane library. Database available on disk and CDROM. Oxford, UK, Update Software (2002)
Google Scholar
Crumley, E.T., Wiebe, N., Cramer, K., Klassen, T.P., Hartling, L.: Which resources should be used to identify rct/ccts for systematic reviews: a systematic review. BMC Med. Res. Methodol. 5, 1–13 (2005)
Article Google Scholar
Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314 (2023)
Di Nunzio, G.M., Beghini, F., Vezzani, F., Henrot, G.: An interactive two-dimensional approach to query aspects rewriting in systematic reviews. IMS unipd at CLEF eHealth task 2. In: CEUR Workshop Proceedings: Working Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)
Google Scholar
Di Nunzio, G.M., Ciuffreda, G., Vezzani, F.: Interactive sampling for systematic reviews. IMS unipd at CLEF 2018 eHealth task 2. In: CEUR Workshop Proceedings: Working Notes of CLEF 2018: Conference and Labs of the Evaluation Forum (2018)
Google Scholar
Kanoulas, E., Li, D., Azzopardi, L., Spijker, R.: CLEF 2017 technologically assisted reviews in empirical medicine overview. In: CEUR Workshop Proceedings: Working Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)
Google Scholar
Kanoulas, E., Li, D., Azzopardi, L., Spijker, R.: CLEF 2019 technology assisted reviews in empirical medicine overview. In: CEUR Workshop Proceedings: Working Notes of CLEF 2018: Conference and Labs of the Evaluation Forum. vol. 2380 (2019)
Google Scholar
Kanoulas, E., Spijker, R., Li, D., Azzopardi, L.: CLEF 2018 technology assisted reviews in empirical medicine overview. In: CEUR Workshop Proceedings: Working Notes of CLEF 2018: Conference and Labs of the Evaluation Forum (2018)
Google Scholar
Köpf, A., Kilcher, Y., et al.: Openassistant conversations-democratizing large language model alignment. arXiv preprint arXiv:2304.07327 (2023)
Kozorovitsky, A.K., Kurland, O.: From"identical"to"similar": fusing retrieved lists based on inter-document similarities. J. Artif. Intell. Res. 41, 267–296 (2011)
Article MathSciNet Google Scholar
Lagopoulos, A., Anagnostou, A., Minas, A., Tsoumakas, G.: Learning-to-rank and relevance feedback for literature appraisal in empirical medicine. In: Bellot, P., et al. (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction: 9th International Conference of the CLEF Association, CLEF 2018, Avignon, France, September 10-14, 2018, Proceedings, pp. 52–63. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-319-98932-7_5
Chapter Google Scholar
Lee, G.E., Sun, A.: Seed-driven document ranking for systematic reviews in evidence-based medicine. In: Proceedings of the 41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 455–464 (2018)
Google Scholar
Lee, J., et al.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020)
Article MathSciNet Google Scholar
Lu, Y., Bartolo, M., Moore, A., Riedel, S., Stenetorp, P.: Fantastically ordered prompts and where to find them: overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786 (2021)
Minas, A., Lagopoulos, A., Tsoumakas, G.: Aristotle university’s approach to the technologically assisted reviews in empirical medicine task of the 2018 CLEF eHealth lab. In: CEUR Workshop Proceedings: Working Notes of CLEF 2018: Conference and Labs of the Evaluation Forum (2018)
Google Scholar
Miwa, M., Thomas, J., O’Mara-Eves, A., Ananiadou, S.: Reducing systematic review workload through certainty-based screening. J. Biomed. Inform. 51, 242–253 (2014)
Article Google Scholar
Norman, C.R., Leeflang, M.M., Porcher, R., Névéol, A.: Measuring the impact of screening automation on meta-analyses of diagnostic test accuracy. Syst. Contr. Found. Appl. 8(1), 243 (2019)
Google Scholar
Penedo, G., et al.: The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023)
Robinson, A., et al.: Bio-sieve: exploring instruction tuning large language models for systematic review automation. arXiv preprint arXiv:2308.06610 (2023)
Scells, H., Zuccon, G.: You can teach an old dog new tricks: rank fusion applied to coordination level matching for ranking in systematic reviews. In: Proceedings of the 42nd European Conference on Information Retrieval, pp. 399–414 (2020)
Google Scholar
Scells, H., Zuccon, G., Deacon, A., Koopman, B.: QUT ielab at CLEF eHealth 2017 technology assisted reviews track: initial experiments with learning to rank. In: CEUR Workshop Proceedings: Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)
Google Scholar
Scells, H., Zuccon, G., Koopman, B.: Automatic boolean query refinement for systematic review literature search. In: Proceedings of the 28th World Wide Web Conference, pp. 1646–1656 (2019)
Google Scholar
Scells, H., Zuccon, G., Koopman, B.: A comparison of automatic boolean query formulation for systematic reviews. Information Retrieval Journal, pp. 1–26 (2020)
Google Scholar
Scells, H., Zuccon, G., Koopman, B.: A computational approach for objectively derived systematic review search strategies. In: Proceedings of the 42nd European Conference on Information Retrieval, pp. 385–398 (2020)
Google Scholar
Scells, H., Zuccon, G., Koopman, B., Clark, J.: Automatic boolean query formulation for systematic review literature search. In: Proceedings of the 29th World Wide Web Conference, pp. 1071–1081 (2020)
Google Scholar
Singh, J., Thomas, L.: IIIT-H at CLEF eHealth 2017 task 2: Technologically assisted reviews in empirical medicine. In: CEUR Workshop Proceedings: Working Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)
Google Scholar
Syriani, E., David, I., Kumar, G.: Assessing the ability of chatgpt to screen articles for systematic reviews. arXiv preprint arXiv:2307.06464 (07 2023)
Taori, R., et al.: Stanford alpaca: an instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca (2023)
Thomas, J., Harden, A.: Methods for the thematic synthesis of qualitative research in systematic reviews. BMC Med. Res. Methodol. 8(1), 45 (2008)
Article Google Scholar
Touvron, H., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Touvron, H., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
Wallace, B.C., Small, K., Brodley, C.E., Lau, J., Trikalinos, T.A.: Deploying an interactive machine learning system in an evidence-based practice center: Abstrackr. In: Proceedings of the 2nd ACM International Health Informatics Symposium, pp. 819–824 (2012)
Google Scholar
Wallace, B.C., Trikalinos, T.A., Lau, J., Brodley, C., Schmid, C.H.: Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinform. 11(1), 55 (2010)
Article Google Scholar
Wang, S., Li, H., Scells, H., Locke, D., Zuccon, G.: Mesh term suggestion for systematic review literature search. In: Proceedings of the 25th Australasian Document Computing Symposium, pp. 1–8 (2021)
Google Scholar
Wang, S., Li, H., Zuccon, G.: Mesh suggester: a library and system for mesh term suggestion for systematic review boolean query construction. In: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pp. 1176–1179 (2023)
Google Scholar
Wang, S., Scells, H., Clark, J., Koopman, B., Zuccon, G.: From little things big things grow: A collection with seed studies for medical systematic review literature search. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3176–3186 (2022)
Google Scholar
Wang, S., Scells, H., Koopman, B., Zuccon, G.: Automated mesh term suggestion for effective query formulation in systematic reviews literature search. Intell. Syst. Appl. 200141 (2022)
Google Scholar
Wang, S., Scells, H., Koopman, B., Zuccon, G.: Neural rankers for effective screening prioritisation in medical systematic review literature search. In: Proceedings of the 26th Australasian Document Computing Symposium, pp. 1–10 (2022)
Google Scholar
Wang, S., Scells, H., Koopman, B., Zuccon, G.: Can chatgpt write a good boolean query for systematic review literature search? In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1426–1436. SIGIR ’23, Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3539618.3591703
Wang, S., Scells, H., Potthast, M., Koopman, B., Zuccon, G.: Generating natural language queries for more effective systematic review screening prioritisation. arXiv preprint arXiv:2309.05238 (2023)
Wang, Y., et al.: Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560 (2022)
White, J.: Pubmed 2.0. Medical reference services quarterly 39(4), 382–387 (2020)
Google Scholar
Wu, H., Wang, T., Chen, J., Chen, S., Hu, Q., He, L.: Ecnu at 2018 ehealth task 2: technologically assisted reviews in empirical medicine. Methods-a Companion Methods Enzymol. 4(5), 7 (2018)
Google Scholar
Xu, Y., et al.: Qa-lora: Quantization-aware low-rank adaptation of large language models. arXiv preprint arXiv:2309.14717 (2023)
Yang, C., et al.: Large language models as optimizers. arXiv preprint arXiv:2309.03409 (2023)
Yang, E., MacAvaney, S., Lewis, D.D., Frieder, O.: Goldilocks: just-right tuning of BERT for technology-assisted review. In: Hagen, M., et al. (eds.) ECIR 2022. LNCS, vol. 13185, pp. 502–517. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99736-6_34
Chapter Google Scholar
Zhang, R., Wang, Y.S., Yang, Y.: Generation-driven contrastive self-training for zero-shot text classification with instruction-tuned gpt. arXiv preprint arXiv:2304.11872 (2023)
Zhao, Z., Wallace, E., Feng, S., Klein, D., Singh, S.: Calibrate before use: Improving few-shot performance of language models. In: International Conference on Machine Learning, pp. 12697–12706. PMLR (2021)
Google Scholar
Zou, J., Li, D., Kanoulas, E.: Technology assisted reviews: finding the last few relevant documents by asking Yes/No questions to reviewers. In: Proceedings of the 41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 949–952 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

The University of Queensland, Brisbane, Australia
Shuai Wang & Guido Zuccon
Leipzig University, Leipzig, Germany
Harrisen Scells & Martin Potthast
CSIRO, Canberra, Australia
Shengyao Zhuang & Bevan Koopman
ScaDS.AI, Leipzig, Germany
Martin Potthast

Authors

Shuai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Harrisen Scells
View author publications
You can also search for this author in PubMed Google Scholar
Shengyao Zhuang
View author publications
You can also search for this author in PubMed Google Scholar
Martin Potthast
View author publications
You can also search for this author in PubMed Google Scholar
Bevan Koopman
View author publications
You can also search for this author in PubMed Google Scholar
Guido Zuccon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuai Wang .

Editor information

Editors and Affiliations

Georgetown University, Washington, WA, USA
Nazli Goharian
University of Pisa, Pisa, Italy
Nicola Tonellotto
King's College London, London, UK
Yulan He
University College London, London, UK
Aldo Lipani
University of Glasgow, Glasgow, UK
Graham McDonald
University of Glasgow, Glasgow, UK
Craig Macdonald
University of Glasgow, Glasgow, UK
Iadh Ounis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, S., Scells, H., Zhuang, S., Potthast, M., Koopman, B., Zuccon, G. (2024). Zero-Shot Generative Large Language Models for Systematic Review Screening Automation. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14608. Springer, Cham. https://doi.org/10.1007/978-3-031-56027-9_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-56027-9_25
Published: 20 March 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-56026-2
Online ISBN: 978-3-031-56027-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Zero-Shot Generative Large Language Models for Systematic Review Screening Automation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Closing the gap between open source and commercial large language models for medical evidence summarization

Evaluating the effectiveness of large language models in abstract screening: a comparative analysis

A Study of an Automatic Stopping Strategy for Technologically Assisted Medical Reviews

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Zero-Shot Generative Large Language Models for Systematic Review Screening Automation

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Closing the gap between open source and commercial large language models for medical evidence summarization

Evaluating the effectiveness of large language models in abstract screening: a comparative analysis

A Study of an Automatic Stopping Strategy for Technologically Assisted Medical Reviews

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation