[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Zero-Shot Generative Large Language Models for Systematic Review Screening Automation

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2024)

Abstract

Systematic reviews are crucial for evidence-based medicine as they comprehensively analyse published research findings on specific questions. Conducting such reviews is often resource- and time-intensive, especially in the screening phase, where abstracts of publications are assessed for inclusion in a review . This study investigates the effectiveness of using zero-shot large language models (LLMs) for automatic screening. We evaluate the effectiveness of eight different LLMs and investigate a calibration technique that uses a predefined recall threshold to determine whether a publication should be included in a systematic review. Our comprehensive evaluation using five standard test collections shows that instruction fine-tuning plays an important role in screening, that calibration renders LLMs practical for achieving a targeted recall, and that combining both with an ensemble of zero-shot models saves significant screening time compared to state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 54.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 69.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Other commonly used terms are ‘studies’, ‘research publications’, and ‘references’.

  2. 2.

    https://chat.openai.com/.

  3. 3.

    Although the dataset is described to be public, it currently only contains the DOIs of the systematic review topics but not the labels, making reproduction difficult.

  4. 4.

    https://platform.openai.com/docs/models/gpt-3-5.

  5. 5.

    Note that for consistency of the paper, we name all instruction-tuned models with -ins; The original names are: Alpaca-7b-ins: alpaca; Guanaco-7b-ins: guanaco-7b; Falcon-7b-ins: falcon-7b-instruct; LlaMa2-7b-ins: LlaMa2-7b-chat; LlaMa2-13b-ins: LlaMa2-13b-chat;.

  6. 6.

    We removed topic 18 as no relevant document exited in the candidate document list (the topic only contains one relevant document).

  7. 7.

    In the uncalibrated setting for BioBERT, we established a decision threshold of 0.5 to determine the inclusion of a document in a review topic. Specifically, a document is included if the BioBERT output satisfies the condition \(output \ge 0.5\); otherwise, it is excluded.

  8. 8.

    Comparison is however not straightforward as Bio-SIEVE used most of the datasets we consider here for fine-tuning; we then evaluate effectiveness using the only 27 topics from CLEF-TAR that were not used to fine-tune Bio-SIEVE.

  9. 9.

    https://github.com/ielab/ECIR-2024-llm-screening.

References

  1. Abualsaud, M., Ghelani, N., Zhang, H., Smucker, M.D., Cormack, G.V., Grossman, M.R.: A system for efficient high-recall retrieval. In: Proceedings of the 41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1317–1320 (2018)

    Google Scholar 

  2. Alharbi, A., Briggs, W., Stevenson, M.: retrieving and ranking studies for systematic reviews: University of Sheffield’s Approach to CLEF eHealth 2018 Task 2. In: CEUR Workshop Proceedings: Working Notes of CLEF 2018: Conference and Labs of the Evaluation Forum. vol. 2125. CEUR Workshop Proceedings (2018)

    Google Scholar 

  3. Alharbi, A., Stevenson, M.: Ranking abstracts to identify relevant evidence for systematic reviews: the university of sheffield’s approach to clef ehealth 2017 Task 2. In: CEUR Workshop Proceedings: Working Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)

    Google Scholar 

  4. Alshami, A., Elsayed, M., Ali, E., Eltoukhy, A.E., Zayed, T.: Harnessing the power of chatgpt for automating systematic review process: methodology, case study, limitations, and future directions. Systems 11(7), 351 (2023)

    Article  Google Scholar 

  5. Anagnostou, A., Lagopoulos, A., Tsoumakas, G., Vlahavas, I.P.: Combining inter-review learning-to-rank and intra-review incremental training for title and abstract screening in systematic reviews. In: CEUR Workshop Proceedings: Working Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)

    Google Scholar 

  6. Aum, S., Choe, S.: srbert: automatic article classification model for systematic review using BERT. Syst. Contr. Found. Appl. 10(1), 1–8 (2021)

    Google Scholar 

  7. Bramer, W.M., Rethlefsen, M.L., Kleijnen, J., Franco, O.H.: Optimal database combinations for literature searches in systematic reviews: a prospective exploratory study. Syst. Contr. Found. Appl. 6, 1–12 (2017)

    Google Scholar 

  8. Callaghan, M.W., Müller-Hansen, F.: Statistical stopping criteria for automated screening in systematic reviews. Syst. Contr. Found. Appl. 9(1), 1–14 (2020)

    Google Scholar 

  9. Carvallo, A., Parra, D., Lobel, H., Soto, A.: Automatic document screening of medical literature using word and text embeddings in an active learning setting. Scientometrics 125, 3047–3084 (2020)

    Article  Google Scholar 

  10. Carvallo, A., Parra, D., Rada, G., Perez, D., Vasquez, J.I., Vergara, C.: Neural language models for text classification in evidence-based medicine. arXiv preprint arXiv:2012.00584 (2020)

  11. Chandler, J., Cumpston, M., Li, T., Page, M.J., Welch, V.A.: Cochrane Handbook for Systematic Reviews of Interventions. John Wiley & Sons (2019)

    Google Scholar 

  12. Chen, J., et al.: ECNU at 2017 eHealth task 2: technologically assisted reviews in empirical medicine. In: CEUR Workshop Proceedings: Working Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)

    Google Scholar 

  13. Chiang, W.L., et al.: Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (Accessed 14 April 2023) (2023)

    Google Scholar 

  14. Clark, J.: Systematic reviewing: introduction, locating studies and data abstraction. In: Doi, S.A.R., Williams, G.M. (eds.) Methods of Clinical Epidemiology, pp. 187–211. Springer Berlin Heidelberg, Berlin, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37131-8_12

    Chapter  Google Scholar 

  15. Cohen, A.M., Ambert, K., McDonagh, M.: A prospective evaluation of an automated classification system to support evidence-based medicine and systematic review. In: AMIA annual symposium proceedings. vol. 2010, p. 121. American Medical Informatics Association (2010)

    Google Scholar 

  16. Cohen, A., Hersh, W., Peterson, K., Yen, P.: Reducing workload in systematic review preparation using automated citation classification. J. Am. Med. Inform. Assoc. 13(2), 206–219 (2006)

    Article  Google Scholar 

  17. Collaboration, C.: The cochrane library. Database available on disk and CDROM. Oxford, UK, Update Software (2002)

    Google Scholar 

  18. Crumley, E.T., Wiebe, N., Cramer, K., Klassen, T.P., Hartling, L.: Which resources should be used to identify rct/ccts for systematic reviews: a systematic review. BMC Med. Res. Methodol. 5, 1–13 (2005)

    Article  Google Scholar 

  19. Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314 (2023)

  20. Di Nunzio, G.M., Beghini, F., Vezzani, F., Henrot, G.: An interactive two-dimensional approach to query aspects rewriting in systematic reviews. IMS unipd at CLEF eHealth task 2. In: CEUR Workshop Proceedings: Working Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)

    Google Scholar 

  21. Di Nunzio, G.M., Ciuffreda, G., Vezzani, F.: Interactive sampling for systematic reviews. IMS unipd at CLEF 2018 eHealth task 2. In: CEUR Workshop Proceedings: Working Notes of CLEF 2018: Conference and Labs of the Evaluation Forum (2018)

    Google Scholar 

  22. Kanoulas, E., Li, D., Azzopardi, L., Spijker, R.: CLEF 2017 technologically assisted reviews in empirical medicine overview. In: CEUR Workshop Proceedings: Working Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)

    Google Scholar 

  23. Kanoulas, E., Li, D., Azzopardi, L., Spijker, R.: CLEF 2019 technology assisted reviews in empirical medicine overview. In: CEUR Workshop Proceedings: Working Notes of CLEF 2018: Conference and Labs of the Evaluation Forum. vol. 2380 (2019)

    Google Scholar 

  24. Kanoulas, E., Spijker, R., Li, D., Azzopardi, L.: CLEF 2018 technology assisted reviews in empirical medicine overview. In: CEUR Workshop Proceedings: Working Notes of CLEF 2018: Conference and Labs of the Evaluation Forum (2018)

    Google Scholar 

  25. Köpf, A., Kilcher, Y., et al.: Openassistant conversations-democratizing large language model alignment. arXiv preprint arXiv:2304.07327 (2023)

  26. Kozorovitsky, A.K., Kurland, O.: From"identical"to"similar": fusing retrieved lists based on inter-document similarities. J. Artif. Intell. Res. 41, 267–296 (2011)

    Article  MathSciNet  Google Scholar 

  27. Lagopoulos, A., Anagnostou, A., Minas, A., Tsoumakas, G.: Learning-to-rank and relevance feedback for literature appraisal in empirical medicine. In: Bellot, P., et al. (eds.) Experimental IR Meets Multilinguality, Multimodality, and Interaction: 9th International Conference of the CLEF Association, CLEF 2018, Avignon, France, September 10-14, 2018, Proceedings, pp. 52–63. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-319-98932-7_5

    Chapter  Google Scholar 

  28. Lee, G.E., Sun, A.: Seed-driven document ranking for systematic reviews in evidence-based medicine. In: Proceedings of the 41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 455–464 (2018)

    Google Scholar 

  29. Lee, J., et al.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (2020)

    Article  MathSciNet  Google Scholar 

  30. Lu, Y., Bartolo, M., Moore, A., Riedel, S., Stenetorp, P.: Fantastically ordered prompts and where to find them: overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786 (2021)

  31. Minas, A., Lagopoulos, A., Tsoumakas, G.: Aristotle university’s approach to the technologically assisted reviews in empirical medicine task of the 2018 CLEF eHealth lab. In: CEUR Workshop Proceedings: Working Notes of CLEF 2018: Conference and Labs of the Evaluation Forum (2018)

    Google Scholar 

  32. Miwa, M., Thomas, J., O’Mara-Eves, A., Ananiadou, S.: Reducing systematic review workload through certainty-based screening. J. Biomed. Inform. 51, 242–253 (2014)

    Article  Google Scholar 

  33. Norman, C.R., Leeflang, M.M., Porcher, R., Névéol, A.: Measuring the impact of screening automation on meta-analyses of diagnostic test accuracy. Syst. Contr. Found. Appl. 8(1), 243 (2019)

    Google Scholar 

  34. Penedo, G., et al.: The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023)

  35. Robinson, A., et al.: Bio-sieve: exploring instruction tuning large language models for systematic review automation. arXiv preprint arXiv:2308.06610 (2023)

  36. Scells, H., Zuccon, G.: You can teach an old dog new tricks: rank fusion applied to coordination level matching for ranking in systematic reviews. In: Proceedings of the 42nd European Conference on Information Retrieval, pp. 399–414 (2020)

    Google Scholar 

  37. Scells, H., Zuccon, G., Deacon, A., Koopman, B.: QUT ielab at CLEF eHealth 2017 technology assisted reviews track: initial experiments with learning to rank. In: CEUR Workshop Proceedings: Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)

    Google Scholar 

  38. Scells, H., Zuccon, G., Koopman, B.: Automatic boolean query refinement for systematic review literature search. In: Proceedings of the 28th World Wide Web Conference, pp. 1646–1656 (2019)

    Google Scholar 

  39. Scells, H., Zuccon, G., Koopman, B.: A comparison of automatic boolean query formulation for systematic reviews. Information Retrieval Journal, pp. 1–26 (2020)

    Google Scholar 

  40. Scells, H., Zuccon, G., Koopman, B.: A computational approach for objectively derived systematic review search strategies. In: Proceedings of the 42nd European Conference on Information Retrieval, pp. 385–398 (2020)

    Google Scholar 

  41. Scells, H., Zuccon, G., Koopman, B., Clark, J.: Automatic boolean query formulation for systematic review literature search. In: Proceedings of the 29th World Wide Web Conference, pp. 1071–1081 (2020)

    Google Scholar 

  42. Singh, J., Thomas, L.: IIIT-H at CLEF eHealth 2017 task 2: Technologically assisted reviews in empirical medicine. In: CEUR Workshop Proceedings: Working Notes of CLEF 2017: Conference and Labs of the Evaluation Forum (2017)

    Google Scholar 

  43. Syriani, E., David, I., Kumar, G.: Assessing the ability of chatgpt to screen articles for systematic reviews. arXiv preprint arXiv:2307.06464 (07 2023)

  44. Taori, R., et al.: Stanford alpaca: an instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca (2023)

  45. Thomas, J., Harden, A.: Methods for the thematic synthesis of qualitative research in systematic reviews. BMC Med. Res. Methodol. 8(1), 45 (2008)

    Article  Google Scholar 

  46. Touvron, H., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  47. Touvron, H., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  48. Wallace, B.C., Small, K., Brodley, C.E., Lau, J., Trikalinos, T.A.: Deploying an interactive machine learning system in an evidence-based practice center: Abstrackr. In: Proceedings of the 2nd ACM International Health Informatics Symposium, pp. 819–824 (2012)

    Google Scholar 

  49. Wallace, B.C., Trikalinos, T.A., Lau, J., Brodley, C., Schmid, C.H.: Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinform. 11(1), 55 (2010)

    Article  Google Scholar 

  50. Wang, S., Li, H., Scells, H., Locke, D., Zuccon, G.: Mesh term suggestion for systematic review literature search. In: Proceedings of the 25th Australasian Document Computing Symposium, pp. 1–8 (2021)

    Google Scholar 

  51. Wang, S., Li, H., Zuccon, G.: Mesh suggester: a library and system for mesh term suggestion for systematic review boolean query construction. In: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pp. 1176–1179 (2023)

    Google Scholar 

  52. Wang, S., Scells, H., Clark, J., Koopman, B., Zuccon, G.: From little things big things grow: A collection with seed studies for medical systematic review literature search. In: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 3176–3186 (2022)

    Google Scholar 

  53. Wang, S., Scells, H., Koopman, B., Zuccon, G.: Automated mesh term suggestion for effective query formulation in systematic reviews literature search. Intell. Syst. Appl. 200141 (2022)

    Google Scholar 

  54. Wang, S., Scells, H., Koopman, B., Zuccon, G.: Neural rankers for effective screening prioritisation in medical systematic review literature search. In: Proceedings of the 26th Australasian Document Computing Symposium, pp. 1–10 (2022)

    Google Scholar 

  55. Wang, S., Scells, H., Koopman, B., Zuccon, G.: Can chatgpt write a good boolean query for systematic review literature search? In: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1426–1436. SIGIR ’23, Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3539618.3591703

  56. Wang, S., Scells, H., Potthast, M., Koopman, B., Zuccon, G.: Generating natural language queries for more effective systematic review screening prioritisation. arXiv preprint arXiv:2309.05238 (2023)

  57. Wang, Y., et al.: Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560 (2022)

  58. White, J.: Pubmed 2.0. Medical reference services quarterly 39(4), 382–387 (2020)

    Google Scholar 

  59. Wu, H., Wang, T., Chen, J., Chen, S., Hu, Q., He, L.: Ecnu at 2018 ehealth task 2: technologically assisted reviews in empirical medicine. Methods-a Companion Methods Enzymol. 4(5), 7 (2018)

    Google Scholar 

  60. Xu, Y., et al.: Qa-lora: Quantization-aware low-rank adaptation of large language models. arXiv preprint arXiv:2309.14717 (2023)

  61. Yang, C., et al.: Large language models as optimizers. arXiv preprint arXiv:2309.03409 (2023)

  62. Yang, E., MacAvaney, S., Lewis, D.D., Frieder, O.: Goldilocks: just-right tuning of BERT for technology-assisted review. In: Hagen, M., et al. (eds.) ECIR 2022. LNCS, vol. 13185, pp. 502–517. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99736-6_34

    Chapter  Google Scholar 

  63. Zhang, R., Wang, Y.S., Yang, Y.: Generation-driven contrastive self-training for zero-shot text classification with instruction-tuned gpt. arXiv preprint arXiv:2304.11872 (2023)

  64. Zhao, Z., Wallace, E., Feng, S., Klein, D., Singh, S.: Calibrate before use: Improving few-shot performance of language models. In: International Conference on Machine Learning, pp. 12697–12706. PMLR (2021)

    Google Scholar 

  65. Zou, J., Li, D., Kanoulas, E.: Technology assisted reviews: finding the last few relevant documents by asking Yes/No questions to reviewers. In: Proceedings of the 41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 949–952 (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuai Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, S., Scells, H., Zhuang, S., Potthast, M., Koopman, B., Zuccon, G. (2024). Zero-Shot Generative Large Language Models for Systematic Review Screening Automation. In: Goharian, N., et al. Advances in Information Retrieval. ECIR 2024. Lecture Notes in Computer Science, vol 14608. Springer, Cham. https://doi.org/10.1007/978-3-031-56027-9_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-56027-9_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-56026-2

  • Online ISBN: 978-3-031-56027-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics