[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Explanatory argument extraction of correct answers in resident medical exams

Published: 01 November 2024 Publication History

Abstract

Developing technology to assist medical experts in their everyday decision-making is currently a hot topic in the field of Artificial Intelligence (AI). This is specially true within the framework of Evidence-Based Medicine (EBM), where the aim is to facilitate the extraction of relevant information using natural language as a tool for mediating in human–AI interaction. In this context, AI techniques can be beneficial in finding arguments for past decisions in evolution notes or patient journeys, especially when different doctors are involved in a patient’s care. In those documents the decision-making process towards treating the patient is reported. Thus, applying Natural Language Processing (NLP) techniques has the potential to assist doctors in extracting arguments for a more comprehensive understanding of the decisions made. This work focuses on the explanatory argument identification step by setting up the task in a Question Answering (QA) scenario in which clinicians ask questions to the AI model to assist them in identifying those arguments. In order to explore the capabilities of current AI-based language models, we present a new dataset which, unlike previous work: (i) includes not only explanatory arguments for the correct hypothesis, but also arguments to reason on the incorrectness of other hypotheses; (ii) the explanations are written originally in Spanish by doctors to reason over cases from the Spanish Residency Medical Exams. Furthermore, this new benchmark allows us to set up a novel extractive task by identifying the explanation written by medical doctors that supports the correct answer within an argumentative text. An additional benefit of our approach lies in its ability to evaluate the extractive performance of language models using automatic metrics, which in the Antidote CasiMedicos dataset corresponds to a 74.47 F1 score. Comprehensive experimentation shows that our novel dataset and approach is an effective technique to help practitioners in identifying relevant evidence-based explanations for medical questions.

Highlights

A novel extractive task to identify the explanations of the correct answer in commented medical exams.
The first dataset for Medical QA in a language other than English.
Promising results in identifying relevant evidence-based explanations for medical questions.

References

[1]
Sackett D.L., Rosenberg W.M.C., Gray J.A.M., Haynes R.B., Richardson W.S., Evidence based medicine: what it is and what it isn’t, BMJ 312 (1996) 71–72.
[2]
Mayer T., Marro S., Cabrio E., Villata S., Enhancing evidence-based medicine with natural language argumentative analysis of clinical trials, Artif Intell Med 118 (2021).
[3]
Beltagy I., Lo K., Cohan A., SciBERT: A pretrained language model for scientific text, in: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, Association for Computational Linguistics, Hong Kong, China, 2019, pp. 3615–3620.
[4]
Lee J., Yoon W., Kim S., Kim D., Kim S., So C.H., et al., BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36 (4) (2020) 1234–1240.
[5]
Gu Y., Tinn R., Cheng H., Lucas M.R., Usuyama N., Liu X., et al., Domain-specific language model pretraining for biomedical natural language processing, ACM Trans Comput Healthc (HEALTH) 3 (2021) 1–23.
[6]
Brown T., Mann B., Ryder N., Subbiah M., Kaplan J.D., Dhariwal P., et al., Language models are few-shot learners, Adv Neural Inf Process Syst 33 (2020) 1877–1901.
[7]
Singhal K., Azizi S., Tu T., Mahdavi S., Wei J., Chung H., et al., Large language models encode clinical knowledge, Nature 38 (620(7972)) (2023) 172–180.
[8]
Luo R., Sun L., Xia Y., Qin T., Zhang S., Poon H., et al., BioGPT: Generative pre-trained transformer for biomedical text generation and mining, Brief Bioinform (2022).
[9]
Phan L., Anibal J.T., Tran H.T., Chanana S., Bahadroglu E., Peltekian A., et al., SciFive: a text-to-text transformer model for biomedical literature, 2021, arXiv arXiv:2106.03598.
[10]
Lin S., Hilton J., Evans O., TruthfulQA: Measuring how models mimic human falsehoods, in: Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 3214–3252.
[11]
Agerri R, Alonso I, Atutxa A, Berrondo A, Estarrona A, García-Ferrero I, et al. HiTZ@Antidote: Argumentation-driven Explainable Artificial Intelligence for Digital Medicine. In: SEPLN 2023: 39th international conference of the Spanish society for natural language processing. 2023.
[12]
Raffel C., Shazeer N., Roberts A., Lee K., Narang S., Matena M., et al., Exploring the limits of transfer learning with a unified text-to-text transformer, J Mach Learn Res 21 (1) (2020) 5485–5551.
[13]
Rajpurkar P, Zhang J, Lopyrev K, Liang P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In: Proceedings of the 2016 conference on empirical methods in natural language processing. 2016, p. 2383–92.
[14]
Fisch A., Talmor A., Jia R., Seo M., Choi E., Chen D., MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension, in: Proceedings of the 2nd workshop on machine reading for question answering, Association for Computational Linguistics, Hong Kong, China, 2019, pp. 1–13.
[15]
Rajpurkar P, Jia R, Liang P. Know What You Don’t Know: Unanswerable Questions for SQuAD. In: Proceedings of the 56th annual meeting of the association for computational linguistics (volume 2: short papers). 2018, p. 784–9.
[16]
Reddy S., Chen D., Manning C.D., CoQA: A conversational question answering challenge, Trans Assoc Comput Linguist 7 (2019) 249–266.
[17]
Kwiatkowski T., Palomaki J., Redfield O., Collins M., Parikh A., Alberti C., et al., Natural questions: A benchmark for question answering research, Trans Assoc Comput Linguist 7 (2019) 453–466.
[18]
Yang Y, Yih W-t, Meek C. WIKIQA: A Challenge Dataset for Open-Domain Question Answering. In: Proceedings of the 2015 conference on empirical methods in natural language processing. 2015, p. 2013–8.
[19]
Lai G., Xie Q., Liu H., Yang Y., Hovy E., RACE: Large-scale ReAding comprehension dataset from examinations, in: Proceedings of the 2017 conference on empirical methods in natural language processing, Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 785–794.
[20]
Zellers R., Bisk Y., Schwartz R., Choi Y., SWAG: A large-scale adversarial dataset for grounded commonsense inference, in: Proceedings of the 2018 conference on empirical methods in natural language processing, Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 93–104.
[21]
Hovy EH, Gerber L, Hermjakob U, Junk M, Lin C-Y. Question Answering in Webclopedia. In: TREC, vol. 52. 2000, p. 53–6.
[22]
Moreda P., Llorens H., Saquete E., Palomar M., Combining semantic information in question answering systems, Inf Process Manage 47 (6) (2011) 870–885.
[23]
Bordes A., Chopra S., Weston J., Question answering with subgraph embeddings, in: Proceedings of the 2014 conference on empirical methods in natural language processing, Association for Computational Linguistics, Doha, Qatar, 2014, pp. 615–620.
[24]
Devlin J., Chang M.-W., Lee K., Toutanova K., BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186.
[25]
Shao T., Guo Y., Chen H., Hao Z., Transformer-based neural network for answer selection in question answering, IEEE Access 7 (2019) 26146–26156.
[26]
Jin Q., Dhingra B., Liu Z., Cohen W., Lu X., PubMedQA: A dataset for biomedical research question answering, in: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing, Association for Computational Linguistics, Hong Kong, China, 2019, pp. 2567–2577.
[27]
Abacha AB, Agichtein E, Pinter Y, Demner-Fushman D. Overview of the medical question answering task at TREC 2017 LiveQA. In: TREC. 2017, p. 1–12.
[28]
Vilares D., Gómez-Rodríguez C., HEAD-QA: A healthcare dataset for complex reasoning, in: Proceedings of the 57th annual meeting of the association for computational linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 960–966.
[29]
Abacha AB, Mrabet Y, Sharp M, Goodwin TR, Shooshan SE, Demner-Fushman D. Bridging the Gap Between Consumers’ Medication Questions and Trusted Answers. In: MedInfo. 2019, p. 25–9.
[30]
Jin D., Pan E., Oufattole N., Weng W.-H., Fang H., Szolovits P., What disease does this patient have? a large-scale open domain question answering dataset from medical exams, Appl Sci 11 (14) (2021) 6421.
[31]
Pal A., Umapathi L.K., Sankarasubbu M., MedMCQA: A large-scale multi-subject multi-choice dataset for medical domain question answering, in: Conference on health, inference, and learning, PMLR, 2022, pp. 248–260.
[32]
Abacha AB, Shivade C, Demner-Fushman D. Overview of the MEDIQA 2019 Shared Task on Textual Inference, Question Entailment and Question Answering. In: Proceedings of the 18th bioNLP workshop and shared task. 2019, p. 370–9.
[33]
Nentidis A., Bougiatiotis K., Krithara A., Paliouras G., Results of the seventh edition of the BioASQ Challenge, in: Machine learning and knowledge discovery in databases: international workshops of ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, proceedings, part II, Springer, 2020, pp. 553–568.
[34]
Nentidis A., Krithara A., Bougiatiotis K., Krallinger M., Rodriguez-Penagos C., Villegas M., et al., Overview of BioASQ 2020: The eighth BioASQ challenge on large-scale biomedical semantic indexing and question answering, in: Experimental IR meets multilinguality, multimodality, and interaction: 11th international conference of the CLEF association, CLEF 2020, Thessaloniki, Greece, September 22–25, 2020, proceedings 11, Springer, 2020, pp. 194–214.
[35]
Nentidis A., Katsimpras G., Vandorou E., Krithara A., Gasco L., Krallinger M., Paliouras G., Overview of BioASQ 2021: the ninth BioASQ challenge on large-scale biomedical semantic indexing and question answering, in: Experimental IR meets multilinguality, multimodality, and interaction: 12th international conference of the CLEF association, CLEF 2021, virtual event, September 21–24, 2021, proceedings 12, Springer, 2021, pp. 239–263.
[36]
Ngai H., Park Y., Chen J., Parsapoor M., Transformer-based models for question answering on COVID19, 2021, arXiv preprint arXiv:2101.11432.
[37]
Yoon W., Jackson R., Lagerberg A., Kang J., Sequence tagging for biomedical extractive question answering, Bioinformatics 38 (15) (2022) 3794–3801.
[38]
Gu Y., Tinn R., Cheng H., Lucas M., Usuyama N., Liu X., et al., Domain-specific language model pretraining for biomedical natural language processing, ACM Trans Comput Healthc (HEALTH) 3 (1) (2021) 1–23.
[39]
Gutiérrez-Fandiño A., Armengol-Estapé J., Pàmies M., Llop-Palao J., Silveira-Ocampo J., Carrino C.P., et al., Maria: Spanish language models, Procesamiento del Lenguaje Natural, Revista 68 (2022) 39–60.
[40]
Máximo S. Supervised domain adaptation for extractive question answering in Spanish. In: Proceedings of the Iberian languages evaluation forum. 2022.
[41]
Rosá A., Chiruzzo L., Bouza L., Dragonetti A., Castro S., Etcheverry M., et al., Overview of QuALES at IberLEF 2022: Question Answering Learning from Examples in Spanish, Sociedad Española para el Procesamiento del Lenguaje Natural, 2022.
[42]
Chari S., Acharya P., Gruen D.M., Zhang O., Eyigoz E.K., Ghalwash M., et al., Informing clinical assessment by contextualizing post-hoc explanations of risk prediction models in type-2 diabetes, Artif Intell Med (ISSN ) 137 (2023).
[43]
Singhal K., Tu T., Gottweis J., Sayres R., Wulczyn E., Hou L., et al., Towards expert-level medical question answering with large language models, 2023.
[44]
Chowdhery A., Narang S., Devlin J., Bosma M., Mishra G., Roberts A., et al., Palm: Scaling language modeling with pathways, 2022.
[45]
Phang J., Févry T., Bowman S.R., Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks, 2018, arXiv arXiv:1811.01088.
[46]
Casimiro Pio C, Marta R. C-j, Jose A. R. F. Automatic Spanish Translation of the SQuAD Dataset for Multilingual Question Answering. In: Proceedings of the 12th conference on language resources and evaluation. 2019, p. 5115–523.
[47]
Taulé M, Martí MA, Recasens M. Ancora: Multilevel annotated corpora for Catalan and Spanish. In: Lrec. 2008.
[48]
Agerri R., Agirre E., Lessons learned from the evaluation of Spanish language models, Proces del Leng Natural 70 (2023) 157–170.
[49]
Otegi A, Agirre A, Campos JA, Soroa A, Agirre E. Conversational Question Answering in Low Resource Scenarios: A Dataset and Case Study for Basque. In: Proceedings of the twelfth language resources and evaluation conference. 2020, p. 436–42.
[50]
He P., Gao J., Chen W., DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing, 2021.
[51]
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, et al. Unsupervised Cross-lingual Representation Learning at Scale. In: Annual meeting of the association for computational linguistics. 2019.
[52]
López-García G., Jerez J.M., Ribelles N., Alba E., Veredas F.J., Transformers for clinical coding in Spanish, IEEE Access 9 (2021) 72387–72397.
[53]
de la Iglesia I., Atutxa A., Gojenola K., Barrena A., EriBERTa: A bilingual pre-trained language model for clinical natural language processing, 2023, arXiv:2306.07373.
[54]
Cañete J, Chaperon G, Fuentes R, Ho J-H, Kang H, Pérez J. Spanish Pre-Trained BERT Model and Evaluation Data. In: PML4DC at ICLR 2020. 2020.
[55]
Carrino C.P., Llop J., Pàmies M., Gutiérrez-Fandiño A., Armengol-Estapé J., Silveira-Ocampo J., et al., Pretrained biomedical language models for clinical NLP in Spanish, in: Proceedings of the 21st workshop on biomedical language processing, Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 193–199.
[56]
Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., et al., RoBERTa: A robustly optimized BERT pretraining approach, 2019, arXiv arXiv:1907.11692.
[57]
Komatsuzaki A., One epoch is all you need, 2019, arXiv.
[58]
Clark K, Luong M-T, Le QV, Manning CD. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In: ICLR. 2020.
[59]
Wolf T., Debut L., Sanh V., Chaumond J., Delangue C., et al., Transformers: State-of-the-art natural language processing, in: Liu Q., Schlangen D. (Eds.), Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, Association for Computational Linguistics, 2020, pp. 38–45.
[60]
Yeginbergenova A., Agerri R., Cross-lingual argument mining in the medical domain, 2024, arXiv arXiv:2301.10527.
[61]
Lee C.-H., Lee H.-Y., Cross-lingual transfer learning for question answering, 2019, arXiv preprint arXiv:1907.06042.
[62]
Pires TJP, Schlinger E, Garrette D. How Multilingual is Multilingual BERT?. In: ACL. 2019.
[63]
Artetxe M, Labaka G, Agirre E. Translation Artifacts in Cross-lingual Transfer Learning. In: Proceedings of the 2020 conference on empirical methods in natural language processing. 2020, p. 7674–84.
[64]
García-Ferrero I, Agerri R, Rigau G. Model and Data Transfer for Cross-Lingual Sequence Labelling in Zero-Resource Settings. In: Findings of the association for computational linguistics. 2022, p. 6403–16.
[65]
Xiong G., Jin Q., Lu Z., Zhang A., Benchmarking retrieval-augmented generation for medicine, 2024, arXiv preprint arXiv:2402.13178.
[66]
Wu C., Lin W., Zhang X., Zhang Y., Wang Y., Xie W., PMC-LLaMA: Towards building open-source language models for medicine, 2023.
[67]
Labrak Y., Bazoge A., Morin E., Gourraud P.-A., Rouvier M., Dufour R., BioMistral: A collection of open-source pretrained large language models for medical domains, 2024.
[68]
Xie Q., Schenck E.J., Yang H.S., Chen Y., Peng Y., Wang F., Faithful AI in medicine: A systematic review with large language models and beyond, 2023, medRxiv Cold Spring Harbor Laboratory Preprints.
[69]
García-Ferrero I., Agerri R., Atutxa A., Cabrio E., de la Iglesia I., Lavelli A., et al., Medical mT5: An open-source multilingual text-to-text LLM for the medical domain, 2024, arXiv:2404.07613.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Artificial Intelligence in Medicine
Artificial Intelligence in Medicine  Volume 157, Issue C
Nov 2024
404 pages

Publisher

Elsevier Science Publishers Ltd.

United Kingdom

Publication History

Published: 01 November 2024

Author Tags

  1. Explainable artificial intelligence
  2. Argumentation
  3. Question answering
  4. Resident medical exams
  5. Natural language processing

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 28 Jan 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media