More Web Proxy on the site http://driver.im/

article

An integrated framework for de-identifying unstructured medical data

Authors:

Li XiongAuthors Info & Claims

Data & Knowledge Engineering, Volume 68, Issue 12

Pages 1441 - 1451

https://doi.org/10.1016/j.datak.2009.07.006

Published: 01 December 2009 Publication History

Abstract

While there is an increasing need to share medical information for public health research, such data sharing must preserve patient privacy without disclosing any information that can be used to identify a patient. A considerable amount of research in data privacy community has been devoted to formalizing the notion of identifiability and developing techniques for anonymization but are focused exclusively on structured data. On the other hand, efforts on de-identifying medical text documents in medical informatics community rely on simple identifier removal or grouping techniques without taking advantage of the research developments in the data privacy community. This paper attempts to fill the above gaps and presents a framework and prototype system for de-identifying health information including both structured and unstructured data. We empirically study a simple Bayesian classifier, a Bayesian classifier with a sampling based technique, and a conditional random field based classifier for extracting identifying attributes from unstructured data. We deploy a k-anonymization based technique for de-identifying the extracted data to preserve maximum data utility. We present a set of preliminary evaluations showing the effectiveness of our approach.

References

[1]

C.C. Aggarwal, On k-anonymity and the curse of dimensionality, in: Thirty-first International Conference on Very Large Databases (VLDB), 2005, pp. 901-909.

Digital Library

[2]

G.Aggarwal, T.Feder, K.Kenthapadi, S.Khuller, R. Panigrahy, D. Thomas, A. Zhu, Achieving anonymity via clustering, in: Proceedings of the Twenty-Fifth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 2006, pp. 153-162.

Digital Library

[3]

Beckwith, R.M.B.A., Balis, U.J. and Kuo, F., Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Medical Informatics and Decision Making. v6 i12.

[4]

R.J. Bayardo, R. Agrawal, Data privacy through optimal k-anonymization, in: ICDE'05: Proceedings of the 21st International Conference on Data Engineering (ICDE'05), Washington, DC, USA, IEEE Computer Society, 2005, pp. 217-228.

Digital Library

[5]

E. Bertino, B. Ooi, Y. Yang, R.H. Deng. Privacy and ownership preserving of outsourced medical data, in: Proceedings of the 21st International Conference on Data Engineering, ICDE 2005, 2005.

Digital Library

[6]

I. Bhattacharya, L. Getoor, Iterative record linkage for cleaning and integration, in: DMKD'04: Proceedings of the 9th ACM SIGMOD Workshop on Research issues in Data Mining and Knowledge Discovery, 2004.

Digital Library

[7]

Y. Bu, A. Fu, R. Wong, L. Chen, J. Li. Privacy preserving serial data publishing by role composition, in: Thirty-fourth International Conference on Very Large Data Bases (VLDB), 2008.

Digital Library

[8]

G. Cormode, D. Srivastava, T. Yu, Q. Zhang, Anonymizing bipartite graph data using safe groupings, in: Thirty-fourth International Conference on Very Large Data Bases (VLDB), 2008.

Digital Library

[9]

A. Culotta, A. McCallum, J. Betz, Integrating probabilistic extraction models and data mining to discover relations and patterns in text, in: HLT/NAACL, Morristown, NJ, USA, Association for Computational Linguistics, 2006, pp. 296-303.

Digital Library

[10]

X. Dong, A. Halevy, J. Madhavan, Reference reconciliation in complex information spaces, in: SIGMOD'05: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, 2005.

Digital Library

[11]

B.C.M. Fung, K. Wang, R. Chen, P.S. Yu, Privacy-preserving data publishing: a survey on recent developments, ACM Computing Surveys, 2010.

Digital Library

[12]

B.C.M. Fung, K. Wang, P.S. Yu, Top-down specialization for information and privacy preservation, in: Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE 2005), Tokyo, Japan, 2005, pp. 205-216.

Digital Library

[13]

L. Gu, R. Baxter, D. Vickers, C. Rainsford, Record linkage: current practice and future directions.

[14]

Gupta, D., Saul, M. and Gilbertson, J., Evaluation of a deidentification (de-id) software engine to share pathology reports and clinical documents for research. American Journal of Clinical Pathology. 76-186.

[15]

V.S. Iyengar, Transforming data to satisfy privacy constraints, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002, pp. 279-288.

Digital Library

[16]

P. Jurczyk, J.J. Lu, L. Xiong, J.D. Cragan, A. Correa, Fril: a tool for comparative record linkage, in: AMIA 2008 Annual Symposium, 2008.

[17]

D.V. Kalashnikov, S. Mehrotra, Z. Chen, Exploiting relationships for domain-independent data cleaning, in: SIAM International Conference on Data Mining, 2005.

[18]

D. Kifer, J. Gehrke. Injecting utility into anonymized datasets, in: SIGMOD Conference, 2006, pp. 217-228.

Digital Library

[19]

J. Lafferty, A. McCallum, F. Pereira, Conditional random fields: probabilistic models for segmenting and labeling sequence data, in: Proceedings of the 18th International Conference on Machine Learning, 2001.

Digital Library

[20]

R. Leaman, G.G. Banner, An executable survey of advances in biomedical named entity recognition, in: Pacific Symposium on Biocomputing, 2008.

[21]

K. LeFevre, D. Dewitt, R. Ramakrishnan, Incognito: efficient full-domain k-anonymity, in: ACM SIGMOD International Conference on Management of Data, 2005.

Digital Library

[22]

K. LeFevre, D. DeWitt, R. Ramakrishnan, Mondrian multidimensional k-anonymity, in: Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 2006.

Digital Library

[23]

N. Li, T. Li, T-closeness: privacy beyond k-anonymity and l-diversity, in: Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, 2007.

[24]

A. Machanavajjhala, J. Gehrke, D. Kifer, M. Venkitasubramaniam, L-diversity: privacy beyond k-anonymity, in: Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), 2006, pp. 24.

Digital Library

[25]

Manning, C. and Schutze, H., Foundations of Statistical Natural Language Processing. 1999. MIT Press.

Digital Library

[26]

D.J. Martin, D. Kifer, A. Machanavajjhala, J. Gehrke, J.Y. Halpern, Worst-case background knowledge for privacy-preserving data publishing, in: Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, 2007.

[27]

A. McCallum, Efficiently inducing features of conditional random fields, in: 19th Conference in Uncertainty in Articifical Intelligence (UAI), 2003.

Digital Library

[28]

A.K. McCallum, Mallet: a machine learning for language toolkit. <http://mallet.cs.umass.edu>, 2002.

[29]

A. Meyerson, R. Williams, On the complexity of optimal k-anonymity, in: Proceedings of the Twenty-third ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 2004, pp. 223-228.

Digital Library

[30]

Nadeau, D. and Sekine, S., A survey of named entity recognition and classification. Linguisticae Investigationes. v30 i7.

[31]

M.E. Nergiz, M. Atzori, C. Clifton, Hiding the presence of individuals from shared databases, in: SIGMOD Conference, 2007, pp. 665-676.

Digital Library

[32]

T. Sibanda, O. Uzuner, Role of local context in de-identification of ungrammatical fragmented text, in: North American Chapter of Association for Computational Linguistics/Human Language Technology, 2006.

Digital Library

[33]

Sweeney, L., Replacing personally-identifying information in medical the records scrub system. Journal of the American Informatics Association. 333-337.

[34]

L. Sweeney, Guaranteeing anonymity when sharing medical data, the datafly system, in: Proceedings of AMIA Annual Fall Symposium, 1997.

[35]

Sweeney, L., k-Anonymity: a model for protecting privacy. International Journal on Uncertainty Fuzziness, and Knowledge-based Systems. v10 i5.

Digital Library

[36]

R.K. Taira, A.A. Bui, H. Kangarloo, Identification of patient name references within medical documents using semantic selectional restrictions, in: Proceedings of AMIA Symposium, 2002, pp. 757-761.

[37]

S.M. Thomas, B. Mamlin, G.S. Adn, C. McDonald, A successful technique for removing names in pathology reports, in: Proceedings of AMIA Symposium, 2002, pp. 777-781.

[38]

T.M. Truta, B. Vinay, Privacy protection: p-sensitive k-anonymity property, in: Proceedings of the 22nd International Conference on Data Engineering Workshops, ICDE 2006, 2006, pp. 94.

Digital Library

[39]

Uzuner, O., Luo, Y. and Szolovits, P., Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association. v14 i5.

[40]

K. Wang, B.C.M. Fung, Anonymizing sequential releases, in: Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006.

Digital Library

[41]

K. Wang, P.S. Yu, S. Chakraborty, Bottom-up generalization: a data mining solution to privacy protection, in: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), November 2004.

Digital Library

[42]

W. Winkler, Overview of record linkage and current research directions, Technical Report Statistics #2006-2, Statistical Research Division, US Bureau of the Census, 2006.

[43]

X. Xiao, Y. Tao, Anatomy: simple and effective privacy preservation, in: Thirrty-second International Conference on Very Large Databases (VLDB), 2006, pp. 139-150.

Digital Library

[44]

X. Xiao, Y. Tao, M-invariance: towards privacy preserving re-publication of dynamic datasets, in: SIGMOD Conference, 2007, pp. 689-700.

Digital Library

[45]

Zadrozny, B., Langford, J. and Abe, N., Cost-sensitive learning by cost-proportionate example weighting. In: ICDM'03: Proceedings of the Third IEEE International Conference on Data Mining, Washington, DC, USA, IEEE Computer Society.

Digital Library

[46]

Q. Zhang, N. Koudas, D. Srivastava, T. Yu, Aggregate query answering on anonymized tables, in: Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, 2007, pp. 116-125.

[47]

S. Zhong, Z. Yang, R.N. Wright, Privacy-enhancing k-anonymization of customer data, in: Proceedings of the Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 2005.

Digital Library

Cited By

Liu LPerez-Concha ONguyen ABennett VJorm L(2022)De-identifying Australian hospital discharge summariesJournal of Biomedical Informatics10.1016/j.jbi.2022.104215135:COnline publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1016/j.jbi.2022.104215
Xu QLu SJia WJiang C(2020)Imbalanced fault diagnosis of rotating machinery via multi-domain feature extraction and cost-sensitive learningJournal of Intelligent Manufacturing10.1007/s10845-019-01522-831:6(1467-1481)Online publication date: 1-Aug-2020
https://dl.acm.org/doi/10.1007/s10845-019-01522-8
Li XQin J(2017)Anonymizing and Sharing Medical Text RecordsInformation Systems Research10.1287/isre.2016.067628:2(332-352)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1287/isre.2016.0676
Show More Cited By

Recommendations

Anonymization of Unstructured Data via Named-Entity Recognition
Modeling Decisions for Artificial Intelligence
Abstract
The anonymization of structured data has been widely studied in recent years. However, anonymizing unstructured data (typically text documents) remains a highly manual task and needs more attention from researchers. The main difficulty when ...
Anonymizing and Sharing Medical Text Records

Health information technology has increased accessibility of health and medical data and benefited medical research and healthcare management. However, there are rising concerns about patient privacy in sharing medical and healthcare data. A large ...
A Survey on Privacy Preserving Dynamic Data Publishing

Many organizations, especially small and medium business SMB enterprises require the collection and sharing of data containing personal information. The privacy of this data must be preserved before outsourcing to the commercial public. Privacy ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Data & Knowledge Engineering

Data & Knowledge Engineering Volume 68, Issue 12

December, 2009

187 pages

ISSN:0169-023X

Issue’s Table of Contents

Copyright © Elsevier B.V. © 2009.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 December 2009

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu LPerez-Concha ONguyen ABennett VJorm L(2022)De-identifying Australian hospital discharge summariesJournal of Biomedical Informatics10.1016/j.jbi.2022.104215135:COnline publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1016/j.jbi.2022.104215
Xu QLu SJia WJiang C(2020)Imbalanced fault diagnosis of rotating machinery via multi-domain feature extraction and cost-sensitive learningJournal of Intelligent Manufacturing10.1007/s10845-019-01522-831:6(1467-1481)Online publication date: 1-Aug-2020
https://dl.acm.org/doi/10.1007/s10845-019-01522-8
Li XQin J(2017)Anonymizing and Sharing Medical Text RecordsInformation Systems Research10.1287/isre.2016.067628:2(332-352)Online publication date: 1-Jun-2017
https://dl.acm.org/doi/10.1287/isre.2016.0676
Mehta BRao U(2016)Privacy Preserving Unstructured Big Data AnalyticsProcedia Computer Science10.1016/j.procs.2016.02.02078:C(120-124)Online publication date: 1-Mar-2016
https://dl.acm.org/doi/10.1016/j.procs.2016.02.020
Li MCarrell DAberdeen JHirschman LKirby JLi BVorobeychik YMalin B(2016)Optimizing annotation resources for natural language de-identification via a game theoretic frameworkJournal of Biomedical Informatics10.1016/j.jbi.2016.03.01961:C(97-109)Online publication date: 1-Jun-2016
https://dl.acm.org/doi/10.1016/j.jbi.2016.03.019
Vico HCalegari D(2015)Software Architecture for Document AnonymizationElectronic Notes in Theoretical Computer Science (ENTCS)10.5555/2793733.2794041314:C(83-100)Online publication date: 15-Jun-2015
https://dl.acm.org/doi/10.5555/2793733.2794041
Li DRastegar-Mojarad MElayavilli RWang YMehrabi SYu YSohn SLi YAfzal NLiu H(2015)A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repositoryProceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/2808719.2808752(315-324)Online publication date: 9-Sep-2015
https://dl.acm.org/doi/10.1145/2808719.2808752
Ferrández OSouth BShen SMeystre S(2012)A hybrid stepwise approach for de-identifying person names in clinical documentsProceedings of the 2012 Workshop on Biomedical Natural Language Processing10.5555/2391123.2391132(65-72)Online publication date: 8-Jun-2012
https://dl.acm.org/doi/10.5555/2391123.2391132
Bishop MCummins JPeisert SSingh ABhumiratana BAgarwal DFrincke DHogarth MKeromytis APeisert SFord RGates C(2010)Relationships and data sanitizationProceedings of the 2010 New Security Paradigms Workshop10.1145/1900546.1900567(151-164)Online publication date: 21-Sep-2010
https://dl.acm.org/doi/10.1145/1900546.1900567
Gardner JXiong LWang FPost ASaltz JGrandison TÇatalyürek ÜLuo GAndrade HSmalheiser N(2010)An evaluation of feature sets and sampling techniques for de-identification of medical recordsProceedings of the 1st ACM International Health Informatics Symposium10.1145/1882992.1883019(183-190)Online publication date: 11-Nov-2010
https://dl.acm.org/doi/10.1145/1882992.1883019
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents