[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

DR.BENCH: : Diagnostic Reasoning Benchmark for Clinical Natural Language Processing

Published: 01 February 2023 Publication History

Abstract

The meaningful use of electronic health records (EHR) continues to progress in the digital era with clinical decision support systems augmented by artificial intelligence. A priority in improving provider experience is to overcome information overload and reduce the cognitive burden so fewer medical errors and cognitive biases are introduced during patient care. One major type of medical error is diagnostic error due to systematic or predictable errors in judgement that rely on heuristics. The potential for clinical natural language processing (cNLP) to model diagnostic reasoning in humans with forward reasoning from data to diagnosis and potentially reduce cognitive burden and medical error has not been investigated. Existing tasks to advance the science in cNLP have largely focused on information extraction and named entity recognition through classification tasks. We introduce a novel suite of tasks coined as Diagnostic Reasoning Benchmarks, Dr.Bench, as a new benchmark for developing and evaluating cNLP models with clinical diagnostic reasoning ability. The suite includes six tasks from ten publicly available datasets addressing clinical text understanding, medical knowledge reasoning, and diagnosis generation. DR.BENCH is the first clinical suite of tasks designed to be a natural language generation framework to evaluate pre-trained language models for diagnostic reasoning. The goal of DR. BENCH is to advance the science in cNLP to support downstream applications in computerized diagnostic decision support and improve the efficiency and accuracy of healthcare providers during patient care. We fine-tune and evaluate the state-of-the-art generative models on DR.BENCH. Experiments show that with domain adaptation pre-training on medical knowledge, the model demonstrated opportunities for improvement when evaluated in DR. BENCH. We share DR. BENCH as a publicly available GitLab repository with a systematic approach to load and evaluate models for the cNLP community. We also discuss the carbon footprint produced during the experiments and encourage future work on DR.BENCH to report the carbon footprint.

References

[1]
Fowler S.A., Yaeger L.H., Yu F., Doerhoff D., Schoening P., Kelly B., Electronic health record: Integrating evidence-based information at the point of clinical decision making, J. Med. Libr. Assoc. 102 (1) (2014) 52–55,. PMID: 24415920; PMCID: PMC3878937.
[2]
Brown P.J., Marquard J.L., Amster B., Romoser M., Friderici J., Goff S., Fisher D., What do physicians read (and ignore) in electronic progress notes?, Appl. Clin. Inform. 5 (02) (2014) 430–444.
[3]
Alpert Joseph S., The electronic medical record: Beauty and the beast, Am. J. Med. 132 (4) (2019) 393–394.
[4]
Aronson Mark D., The purpose of the medical record: Why Lawrence weed still matters, Am. J. Med. 132 (11) (2019) 1256–1257.
[5]
Furlow Bryant, Information overload and unsustainable workloads in the era of electronic health records, Lancet Respiratory Med. 8 (3) (2020) 243–244.
[6]
Hultman Gretchen M., Marquard Jenna L., Lindemann Elizabeth, Arsoniadis Elliot, Pakhomov Serguei, Melton Genevieve B., Challenges and opportunities to improve the clinician experience reviewing electronic progress notes, Appl. Clin. Inform. 10 (03) (2019) 446–453.
[7]
Branch F., Santana I., Hegdé J., Biasing influence of ’mental shortcuts’ on diagnostic decision-making: Radiologists can overlook breast cancer in mammograms when prior diagnostic information is available, Diagnostics (Basel) 12 (1) (2022) 105,. PMID: 35054272; PMCID: PMC8774943.
[8]
Molla S. Donaldson, Janet M. Corrigan, Linda T. Kohn (Eds.), To Err Is Human: Building a Safer Health System, 2000.
[9]
Balogh EP Ball J.R. (Ed.), Committee on Diagnostic Error in Health Care; Board on Health Care Services; Institute of Medicine; the National Academies of Sciences, Engineering, and Medicine. Improving Diagnosis in Health Care, National Academies Press (US), Washington (DC), 2015, PMID 26803862.
[10]
Delvaux N., Piessens V., Burghgraeve T.D., et al., Clinical decision support improves the appropriateness of laboratory test ordering in primary care without increasing diagnostic error: The ELMO cluster randomized trial, Implementation Sci. 15 (100) (2020),.
[11]
Hall Kendall K., et al., Diagnostic Errors. Making Healthcare Safer III: A Critical Analysis of Existing and Emerging Patient Safety Practices [Internet], Agency for Healthcare Research and Quality (US), 2020.
[12]
Balogh Erin P., et al., The Path to Improve Diagnosis and Reduce Diagnostic Error. Improving Diagnosis in Health Care, National Academies Press (US, 2015.
[13]
Croskerry Pat, Nimmo G.R., Better clinical decision making and reducing diagnostic error, J. R. College Physicians of Edinb. 41 (2) (2011) 155–162.
[14]
Johnson Alistair E.W., Pollard Tom J., Shen Lu, wei H. Lehman Li, Feng Mengling, Ghassemi Mohammad, Moody Benjamin, Szolovits Peter, Celi Leo Anthony, Mark Roger G., MIMIC-III, a freely accessible critical care database, Sci. Data 3 (1) (2016) 1–9.
[15]
Gao Y., Dligach D., Christensen L., Tesch S., Laffin R., Xu D., Miller T., Uzuner O., Churpek M.M., Afshar M., A scoping review of publicly available language tasks in clinical natural language processing, J. Am. Med. Inform. Assoc. ocac127 (2022),. Epub ahead of print. PMID: 35923088.
[16]
Romanov Alexey, Shivade Chaitanya, Lessons from natural language inference in the clinical domain, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 1586–1596.
[17]
X. Yue, B. Jimenez, H. Sun, Clinical Reading Comprehension: A Thorough Analysis of the emrQA Dataset, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL’20, 2020.
[18]
Lederman Asher, Lederman Reeva, Verspoor Karin, Tasks as needs: Reframing the paradigm of clinical natural language processing research for real-world decision support, J. Am. Med. Inform. Assoc. (2022).
[19]
Barrows Howard S., Tamblyn Robyn M., Problem-Based Learning: An Approach to Medical Education, Vol. 1, Springer Publishing Company, 1980.
[20]
Bowen Judith L., Educational strategies to promote clinical diagnostic reasoning, N. Engl. J. Med. 355 (21) (2006) 2217–2225.
[21]
Monteiro Sandra M., Norman Geoffrey, Diagnostic reasoning: Where we’ve been, where we’re going, in: Teaching and Learning in Medicine, Vol. 25, no. Sup1, 2013, pp. S26–S32.
[22]
Hammond Kenneth R., Human Judgment and Social Policy: Irreducible Uncertainty, Inevitable Error, Unavoidable Injustice, Oxford University Press on Demand, 2000.
[23]
Pelaccia Thierry, Tardif Jacques, Triby Emmanuel, Charlin Bernard, An analysis of clinical reasoning through a recent and comprehensive approach: The dual-process theory, Med. Educ. Online 16 (1) (2011) 5890.
[24]
Rassinoux A.-M., Decision support, knowledge representation and management: Structuring knowledge for better access, Yearb. Med. Inform. 17 (01) (2008) 80–82.
[25]
Bernd Blobel, Knowledge representation and management enabling intelligent interoperability-principles and standards, in: EFMI-STC, 2013, pp. 3–21.
[26]
Hutton John, Trueman Paul, Henshall Christopher, Coverage with evidence development: An examination of conceptual and policy issues, Int. J. Technol. Assess. Health Care 23 (4) (2007) 425–432.
[27]
Gao Yanjun, Dligach Dmitriy, Miller Timothy, Tesch Samuel, Laffin Ryan, Churpek Matthew M., Afshar Majid, Hierarchical annotation for building a suite of clinical natural language processing tasks: Progress note understanding, in: Proceedings of the Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2022, pp. 5484–5493.
[28]
Pampari Anusri, Raghavan Preethi, Liang Jennifer, Peng Jian, EMRQA: A large corpus for question answering on electronic medical records, in: 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, Association for Computational Linguistics, 2018, pp. 2357–2368.
[29]
Jin Di, Pan Eileen, Oufattole Nassim, Weng Wei-Hung, Fang Hanyi, Szolovits Peter, What disease does this patient have? a large-scale open domain question answering dataset from medical exams, Appl. Sci. 11 (14) (2021) 6421.
[30]
Yanjun Gao, Dmitriy Dligach, Timothy Miller, Dongfang Xu, Matthew M. Churpek, Majid Afshar, Summarizing Patients Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models, in: International Conferences on Computational Linguistics, 2022.
[31]
Raffel Colin, Shazeer Noam, Roberts Adam, Lee Katherine, Narang Sharan, Matena Michael, Zhou Yanqi, Li Wei, Liu Peter J., Exploring the limits of transfer learning with a unified text-to-text transformer, J. Mach. Learn. Res. 21 (140) (2020) 1–67.
[33]
Weed Lawrence L., Medical records that guide and teach, N. Engl. J. Med. 278 (11) (1968) 593–600.
[34]
T. Edinger, D. Demner-Fushman, A.M. Cohen, S. Bedrick, W. Hersh, Evaluation of Clinical Text Segmentation to Facilitate Cohort Retrieval, in: AMIA Annu Symp Proc, Vol. 2017, 2018, pp. 660–669. 29854131; PMCID: PMC5977655.
[35]
A.S. Eisman, K.A. Brown, E.S. Chen, I.N. Sarkar, Clinical Note Section Detection Using a Hidden Markov Model of Unified Medical Language System Semantic Types, in: AMIA Annu Symp Proc. 2022 Feb 21, 2021, pp. 418–427. 35308919; PMCID: PMC8861726.
[36]
Andrew Trotman, Antti Puurula, Blake Burgess, Improvements to BM25 and language models examined, in: Proceedings of the 2014 Australasian Document Computing Symposium, 2014, pp. 58–65.
[37]
Lin Chin-Yew, Rouge: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, 2004, pp. 74–81.
[38]
Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, Noah A. Smith, Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8342–8360.
[39]
Lee Jinhyuk, Yoon Wonjin, Kim Sungdong, Kim Donghyeon, Kim Sunkyu, So Chan Ho, Kang Jaewoo, BioBERT: A pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36 (4) (2020) 1234–1240.
[40]
Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Naumann, Matthew McDermott, Publicly Available Clinical BERT Embeddings, in: Proceedings of the 2nd Clinical Natural Language Processing Workshop, 2019, pp. 72–78.
[41]
F. Liu, E. Shareghi, Z. Meng, M. Basaldella, N. Collier, Self-Alignment Pretraining for Biomedical Entity Representations, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 4228–4238.
[42]
Phan Long N., Anibal James T., Tran Hieu, Chanana Shaurya, Bahadroglu Erol, Peltekian Alec, Altan-Bonnet Grégoire, Scifive: A text-to-text transformer model for biomedical literature, 2021, arXiv preprint arXiv:2106.03598.
[43]
Bodenreider Olivier, The unified medical language system (UMLS): Integrating biomedical terminology, Nucleic Acids Res. 32 (suppl 1) (2004) D267–D270.
[44]
Tsukagoshi Hayato, Sasano Ryohei, Takeda Koichi, DefSent: Sentence embeddings using definition sentences, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Online. Association for Computational Linguistics, 2021, pp. 411–418.
[45]
Weiwei Guo, Mona Diab, Modeling sentences in the latent space, in: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Vol. 1: Long Papers), 2012, pp. 864–872.
[46]
Li Yikuan, Wehbe Ramsey M., Ahmad Faraz S., Wang Hanyin, Luo Yuan, Clinical-longformer and clinical-BigBird: Transformers for long clinical sequences, 2022, arXiv e-prints arXiv-2201.
[47]
Michihiro Yasunaga, Hongyu Ren, Antoine Bosselut, Percy Liang, Jure Leskovec, QA-GNN: Reasoning with Language Models and Knowledge Graphs for Question Answering, in: North American Chapter of the Association for Computational Linguistics, NAACL, 2021.
[48]
Michihiro Yasunaga, Antoine Bosselut, Hongyu Ren, Xikun Zhang, Christopher D. Manning, Percy Liang, Jure Leskovec, Deep Bidirectional Language-Knowledge Graph Pretraining, in: Advances in Neural Information Processing Systems.
[49]
Luca Soldaini, Nazli Goharian, Quickumls: a fast, unsupervised approach for medical concept extraction, in: MedIR Workshop, Sigir, 2016, pp. 1–4.
[50]
Manuel R. Ciosici, Joe Cecil, Dong-Ho Lee, Alex Hedges, Marjorie Freedman, Ralph Weischedel, Perhaps PTLMs Should Go to School–A Task to Assess Open Book and Closed Book QA, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 6104–6111.
[51]
Sanh Victor, Debut Lysandre, Chaumond Julien, Wolf Thomas, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, 2019, arXiv preprint arXiv:1910.01108.
[52]
Wei Jason, Wang Xuezhi, Schuurmans Dale, Bosma Maarten, Chi Ed, Le Quoc, Zhou Denny, Chain of thought prompting elicits reasoning in large language models, 2022, arXiv preprint arXiv:2201.11903.
[53]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, et al. PaLM: Scaling language modeling with pathways.
[54]
Mrabet Yassine, Demner-Fushman Dina, HOLMS: Alternative summary evaluation with large language models, in: Proceedings of the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics., Barcelona, Spain (Online), 2020, pp. 5679–5688.
[55]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman, Glue: A multi-task benchmark and analysis platform for natural language understanding, in: 7th International Conference on Learning Representations, ICLR 2019, 2019.
[56]
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, Percy Liang, SQuAD: 100, 000+ Questions for Machine Comprehension of Text, in: EMNLP, 2016.
[57]
Gu Yu, Tinn Robert, Cheng Hao, Lucas Michael, Usuyama Naoto, Liu Xiaodong, Naumann Tristan, Gao Jianfeng, Poon Hoifung, Domain-specific language model pretraining for biomedical natural language processing, ACM Trans. Comput. Healthc. (HEALTH) 3 (1) (2021) 1–23.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Journal of Biomedical Informatics
Journal of Biomedical Informatics  Volume 138, Issue C
Feb 2023
150 pages

Publisher

Elsevier Science

San Diego, CA, United States

Publication History

Published: 01 February 2023

Author Tags

  1. Natural language processing
  2. Clinical diagnostic reasoning
  3. Clinical diagnostic decision support
  4. Clinical natural language processing benchmark

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

View Options

View options

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media