Abstract
Clinical notes contain valuable patient information. These notes are written by health care providers with various scientific levels and writing styles. It might be helpful for clinicians and researchers to understand what information is essential when dealing with extensive electronic medical records. Entities recognizing them and mapping them to standard terminologies is crucial to reducing ambiguity in processing clinical notes. Although named entity recognition and entity linking are critical steps in clinical natural language processing, they can produce repetitive and low-value concepts. On the other hand, all parts of a clinical text do not share the same importance or content in predicting the patient's condition. As a result, it is necessary to identify the section in which each content item is recorded and critical concepts to extract meaning from clinical texts. In this study, these challenges have been addressed by using clinical natural language processing techniques. In addition, a set of unsupervised essential phrase extraction methods has been verified and evaluated to identify key concepts. Considering that most clinical concepts are in the form of multi-word expressions and their accurate identification requires the user to specify an n-gram range, we have proposed a shortcut method to preserve the structure of the term based on TF-IDF (Term Frequency Inverse Document Frequency). To evaluate, we have designed two types of downstream tasks (multiple and binary classification) using the capabilities of transformer-based models. The results show the proposed method's superiority in combination with the SciBERT model. Also, they offer an insight into the efficacy of general methods for extracting essential phrases from clinical notes.
Similar content being viewed by others
Data availability
The datasets generated during and/or analyzed during the current study are available at https://physionet.org/content/mimiciii/1.4/.
Code availability
Codes are available at https://github.com/HodaMemar/A3.
Notes
International statistical classification of diseases and related health problems.
The normalized naming system for generic and branded drugs.
Logical Observation Identifiers Names and Codes.
References
Dalianis H (2018) Clinical text mining: secondary use of electronic patient records. Springer, Cham
Holzinger A, Haibe-Kains B, Jurisica I (2019) Why imaging data alone is not enough: AI-based integration of imaging, omics, and clinical data. Eur J Nucl Med Mol Imaging 46(13):2722–2730
Yadav P, Steinbach M, Kumar V, Simon G (2018) Mining Electronic Health Records (EHRs) A Survey. ACM Comput Surv 50(6):1–40
Ford E, Carroll JA, Smith HE, Scott D, Cassell JA (2016) Extracting information from the text of electronic medical records to improve case detection: a systematic review. J Am Med Informatics Assoc 23(5):1007–1015. https://doi.org/10.1093/jamia/ocv180
Sheikhalishahi S, Miotto R, Dudley JT, Lavelli A, Rinaldi F, Osmani V (2019) Natural language processing of clinical notes on chronic diseases: systematic review. JMIR Med informatics 7(2):e12239
Liu Z, Lin Y, Sun M (2020) “Document representation bt - representation learning for natural language processing. Springer, Singapore, pp 91–123
Sammut C, Webb GI (2010) TF–IDF BT-encyclopedia of machine learning. Springer, Boston, pp 986–987
Darabi S, Kachuee M, Fazeli S, Sarrafzadeh M (2020) TAPER: Time-aware patient EHR representation. IEEE J Biomed Heal Informatics 24(11):3268–3275. https://doi.org/10.1109/JBHI.2020.2984931
Sushil M, Šuster S, Luyckx K, Daelemans W (2018) Patient representation learning and interpretable evaluation using clinical notes. J Biomed Inform 84:103–113. https://doi.org/10.1016/j.jbi.2018.06.016
Chen T, Guestrin C (2016) Xgboost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, (pp. 785–794) https://doi.org/10.1145/2939672.2939785
Eyre H et al (2022) Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python. AMIA Annu Symp Proc 2021:438–447
Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG (2001) A Simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 34(5):301–310. https://doi.org/10.1006/jbin.2001.1029
Neumann M, King D, Beltagy I, Ammar W (2019) ScispaCy: Fast and robust models for biomedical natural language processing. BioNLP 2019 - SIGBioMed Work. Biomed. Nat. Lang. Process. Proc. 18th BioNLP Work. Shar. Task, (pp. 319–327). https://doi.org/10.18653/v1/w19-5034.
“sklearn.feature_extraction.text.TfidfVectorizer.” https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
Boudin F (2016) PKE: an open source python-based keyphrase extraction toolkit. In: Proceedings of COLING 2016, the 26th international conference on computational linguistics: system demonstrations, (pp. 69–73) [Online]. Available: https://github.com/boudinfl/pke
Mahata D, Kuriakose J, Shah R, Zimmermann R (2018) Key2vec: Automatic ranked keyphrase extraction from scientific articles using phrase embeddings. In Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: Human Language Technologies, Volume 2 (Short Papers), 2018, pp. 634–639
Maarten Grootendorst, “Keyword Extraction with BERT,” Towar. Data Sci., 2020, [Online]. Available: https://towardsdatascience.com/keyword-extraction-with-bert-724efca412ea
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv Prepr. arXiv1810.04805
Gu Y et al (2021) Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthc 3(1):1–23
Beltagy I, Lo K, Cohan A (2019) SCIBERT: A pretrained language model for scientific text. EMNLP-IJCNLP 2019 - 2019 Conf. Empir. Methods Nat. Lang. Process. 9th Int. Jt. Conf. Nat. Lang. Process. Proc. Conf., pp. 3615–3620. https://doi.org/10.18653/v1/d19-1371
Yogarajan V, Montiel J, Smith T, Pfahringer B (2021) Transformers for multi-label classification of medical text: an empirical comparison. In International Conference on Artificial Intelligence in Medicine, pp. 114–123
Yogarajan V (2022) Domain-specific language models for multi-label classification of medical text. The University of Waikato, New Zealand
Duque A, Fabregat H, Araujo L, Martinez-Romo J (2021) A keyphrase-based approach for interpretable ICD-10 code classification of Spanish medical reports. Artif. Intell. Med. 121:102177. https://doi.org/10.1016/j.artmed.2021.102177
Schopf T, Klimek S, Matthes F (2022) PatternRank: leveraging pretrained language models and part of speech for unsupervised keyphrase extraction. arXiv Prepr. arXiv2210.05245, 2022
Peng Y, Yan S, Lu Z (2019) Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. BioNLP 2019 - SIGBioMed Work. Biomed. Nat. Lang. Process. Proc. 18th BioNLP Work. Shar. Task, pp. 58–65. https://doi.org/10.18653/v1/w19-5006.
Michalopoulos G, Wang Y, Kaka H, Chen H, Wong A (2021) UmlsBERT: clinical domain knowledge augmentation of contextual embeddings using the unified medical language system metathesaurus. arXiv Prepr. arXiv2010.10391
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflict of interest to declare.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Memarzadeh, H., Ghadiri, N., Samwald, M. et al. Applying unsupervised keyphrase methods on concepts extracted from discharge sheets. Pattern Anal Applic 26, 1715–1727 (2023). https://doi.org/10.1007/s10044-023-01198-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-023-01198-0