[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2808719.2808752acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repository

Published: 09 September 2015 Publication History

Abstract

Clinical natural language processing (NLP) has become indispensable in the secondary use of electronic medical records (EMRs). However, it is found that current clinical NLP tools face the problem of portability among different institutes. An ideal solution to this problem is cross-institutional data sharing. However, the legal enforcement of no revelation of protected health information (PHI) obstructs this practice even with the availability of state-of-the-art de-identification tools. In this paper, we investigated the use of a frequency-filtering approach to extract PHI-free sentences utilizing the Enterprise Data Trust (EDT), a large collection of EMRs at Mayo Clinic. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. This assumption originates from the observation that there exist a large number of redundant descriptions of similar patient conditions in EDT. Both manual and automatic evaluations on the sentence set with frequencies higher than one show no PHI are found. The promising results demonstrate the potential of sharing highly frequent sentences among institutes.

References

[1]
D. Demner-Fushman, W. W. Chapman, and C. J. McDonald, "What can natural language processing do for clinical decision support?," Journal of biomedical informatics, vol. 42, pp. 760--772, 2009.
[2]
D. Demner-Fushman and J. Lin, "Answer extraction, semantic clustering, and extractive summarization for clinical question answering," in Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, 2006, pp. 841--848.
[3]
S. T. Wu, H. Liu, D. Li, C. Tao, M. A. Musen, C. G. Chute, et al., "Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis," Journal of the American Medical Informatics Association, pp. amiajnl-2011-000744, 2012.
[4]
H. Chen, R. H. Chiang, and V. C. Storey, "Business Intelligence and Analytics: From Big Data to Big Impact," MIS quarterly, vol. 36, pp. 1165--1188, 2012.
[5]
D. Li, K. Kipper-Schuler, and G. Savova, "Conditional random fields and support vector machines for disorder named entity recognition in clinical texts," in Proceedings of the workshop on current trends in biomedical natural language processing, 2008, pp. 94--95.
[6]
R. E. de Castilho and I. Gurevych, "A broad-coverage collection of portable NLP components for building shareable analysis pipelines," in Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT (OIAF4HLT) at COLING, 2014, pp. 1--11.
[7]
H. Office for Civil Rights, "Standards for privacy of individually identifiable health information. Final rule," Federal Register, vol. 67, p. 53181, 2002.
[8]
J. A. Erlen, "HIPAA---Implications for research," Orthopaedic Nursing, vol. 24, pp. 139--142, 2005.
[9]
F. J. Friedlin and C. J. McDonald, "A software tool for removing patient identifying information from clinical documents," Journal of the American Medical Informatics Association, vol. 15, pp. 601--610, 2008.
[10]
I. Neamatullah, M. M. Douglass, H. L. Li-wei, A. Reisner, M. Villarroel, W. J. Long, et al., "Automated de-identification of free-text medical records," BMC medical informatics and decision making, vol. 8, p. 32, 2008.
[11]
B. A. Beckwith, R. Mahaadevan, U. J. Balis, and F. Kuo, "Development and evaluation of an open source software tool for deidentification of pathology reports," BMC medical informatics and decision making, vol. 6, p. 12, 2006.
[12]
D. Gupta, M. Saul, and J. Gilbertson, "Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research," American journal of clinical pathology, vol. 121, pp. 176--186, 2004.
[13]
J. Aberdeen, S. Bayer, R. Yeniterzi, B. Wellner, C. Clark, D. Hanauer, et al., "The MITRE Identification Scrubber Toolkit: design, training, and assessment," International journal of medical informatics, vol. 79, pp. 849--859, 2010.
[14]
J. Gardner and L. Xiong, "An integrated framework for de-identifying unstructured medical data," Data & Knowledge Engineering, vol. 68, pp. 1441--1451, 2009.
[15]
Ö. Uzuner, T. C. Sibanda, Y. Luo, and P. Szolovits, "A de-identifier for medical discharge summaries," Artificial intelligence in medicine, vol. 42, pp. 13--35, 2008.
[16]
G. Szarvas, R. Farkas, and R. Busa-Fekete, "State-of-the-art anonymization of medical records using an iterative machine learning framework," Journal of the American Medical Informatics Association, vol. 14, pp. 574--580, 2007.
[17]
Y. Guo, R. Gaizauskas, I. Roberts, G. Demetriou, and M. Hepple, "Identifying personal health information using support vector machines," in i2b2 workshop on challenges in natural language processing for clinical data, 2006, pp. 10--11.
[18]
O. Ferrández, B. R. South, S. Shen, F. J. Friedlin, M. H. Samore, and S. M. Meystre, "BoB, a best-of-breed automated text de-identification system for VHA clinical documents," Journal of the American Medical Informatics Association, vol. 20, pp. 77--83, 2013.
[19]
A. Benton, S. Hill, L. Ungar, A. Chung, C. Leonard, C. Freeman, et al., "A system for de-identifying medical message board text," BMC bioinformatics, vol. 12, p. S2, 2011.
[20]
B. Wellner, M. Huyck, S. Mardis, J. Aberdeen, A. Morgan, L. Peshkin, et al., "Rapidly retargetable approaches to de-identification in medical records," Journal of the American Medical Informatics Association, vol. 14, pp. 564--573, 2007.
[21]
R. B. Ash, Basic probability theory: Courier Dover Publications, 2012.
[22]
C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky, "The Stanford CoreNLP natural language processing toolkit," in Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2014, pp. 55--60.
[23]
O. Bodenreider, "The unified medical language system (UMLS): integrating biomedical terminology," Nucleic acids research, vol. 32, pp. D267--D270, 2004.
[24]
M. Volk, B. Ripplinger, Š. Vintar, P. Buitelaar, D. Raileanu, and B. Sacaleanu, "Semantic annotation for concept-based cross-language medical information retrieval," International Journal of Medical Informatics, vol. 67, pp. 97--112, 2002.
[25]
M. Volk, S. Vintar, and P. Buitelaar, "Ontologies in Cross-Language Information Retrieval," in Wissensmanagement, 2003, pp. 43--50.
[26]
G. K. Savova, J. J. Masanz, P. V. Ogren, J. Zheng, S. Sohn, K. C. Kipper-Schuler, et al., "Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications," Journal of the American Medical Informatics Association, vol. 17, pp. 507--513, 2010.
[27]
L. Sweeney, "k-anonymity: A model for protecting privacy," International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, pp. 557--570, 2002.
[28]
P. Samarati, "Protecting respondent's Privacy in Microdata release," IEEE TransacY tions on Knowledge and Data Engineering, vol. 13, 2001.
[29]
P. Samarati and L. Sweeney, "Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression," Technical report, SRI International1998.
[30]
L. Sweeney, "Achieving k-anonymity privacy protection using generalization and suppression," International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, vol. 10, pp. 571--588, 2002.
[31]
A. Machanavajjhala, D. Kifer, J. Abowd, J. Gehrke, and L. Vilhuber, "Privacy: Theory meets practice on the map," in Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on, 2008, pp. 277--286.
[32]
S. Mehrabi, C. M. Schmidt, J. A. Waters, C. Beesley, A. Krishnan, J. Kesterson, et al., "An efficient pancreatic cyst identification methodology using natural language processing," Studies in health technology and informatics, vol. 192, pp. 822--826, 2012.
[33]
V. Garla, V. L. Re, Z. Dorey-Stein, F. Kidwai, M. Scotch, J. Womack, et al., "The Yale cTAKES extensions for document classification: architecture and application," Journal of the American Medical Informatics Association, vol. 18, pp. 614--620, 2011.
[34]
R. Leaman, R. Khare, and Z. Lu, "NCBI at 2013 ShARe/CLEF eHealth Shared Task: disorder normalization in clinical notes with DNorm," Radiology, vol. 42, pp. 1,941, 2011.
[35]
D. Li, N Xia, S Sohn, KB Cohen, CG Chute, H Liu, "Incorporating Topic Modeling Features For Clinic Concept Assertion Classification," in The 5th International Symposium on Languages in Biology and Medicine (LBM 2013) Tokyo, Japan, 2013.
[36]
K. Shvachko, H. Kuang, S. Radia, and R. Chansler, "The hadoop distributed file system," in Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, 2010, pp. 1--10.

Cited By

View all
  • (2034)Title Pending 320Journal of the Society for Clinical Data Management10.47912/jscdm.320Online publication date: 2-Jul-2034
  • (2021)Review: Privacy-Preservation in the Context of Natural Language ProcessingIEEE Access10.1109/ACCESS.2021.31241639(147600-147612)Online publication date: 2021
  • (2020)Overview of the 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity (Preprint)JMIR Medical Informatics10.2196/23375Online publication date: 10-Aug-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
BCB '15: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics
September 2015
683 pages
ISBN:9781450338530
DOI:10.1145/2808719
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 September 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. EMR
  2. PHI-free
  3. cross-institutional data-sharing
  4. frequency-filtering strategy
  5. protected health information
  6. sentence frequency
  7. word bigram

Qualifiers

  • Research-article

Funding Sources

  • NIH

Conference

BCB '15
Sponsor:

Acceptance Rates

BCB '15 Paper Acceptance Rate 48 of 141 submissions, 34%;
Overall Acceptance Rate 254 of 885 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 11 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2034)Title Pending 320Journal of the Society for Clinical Data Management10.47912/jscdm.320Online publication date: 2-Jul-2034
  • (2021)Review: Privacy-Preservation in the Context of Natural Language ProcessingIEEE Access10.1109/ACCESS.2021.31241639(147600-147612)Online publication date: 2021
  • (2020)Overview of the 2019 n2c2/OHNLP Track on Clinical Semantic Textual Similarity (Preprint)JMIR Medical Informatics10.2196/23375Online publication date: 10-Aug-2020
  • (2020)Identification of Semantically Similar Sentences in Clinical Notes: Iterative Intermediate Training using Multi-Task Learning (Preprint)JMIR Medical Informatics10.2196/22508Online publication date: 31-Jul-2020
  • (2019)A privacy-preserving distributed filtering framework for NLP artifactsBMC Medical Informatics and Decision Making10.1186/s12911-019-0867-z19:1Online publication date: 7-Sep-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media