[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

An integrated framework for de-identifying unstructured medical data

Published: 01 December 2009 Publication History

Abstract

While there is an increasing need to share medical information for public health research, such data sharing must preserve patient privacy without disclosing any information that can be used to identify a patient. A considerable amount of research in data privacy community has been devoted to formalizing the notion of identifiability and developing techniques for anonymization but are focused exclusively on structured data. On the other hand, efforts on de-identifying medical text documents in medical informatics community rely on simple identifier removal or grouping techniques without taking advantage of the research developments in the data privacy community. This paper attempts to fill the above gaps and presents a framework and prototype system for de-identifying health information including both structured and unstructured data. We empirically study a simple Bayesian classifier, a Bayesian classifier with a sampling based technique, and a conditional random field based classifier for extracting identifying attributes from unstructured data. We deploy a k-anonymization based technique for de-identifying the extracted data to preserve maximum data utility. We present a set of preliminary evaluations showing the effectiveness of our approach.

References

[1]
C.C. Aggarwal, On k-anonymity and the curse of dimensionality, in: Thirty-first International Conference on Very Large Databases (VLDB), 2005, pp. 901-909.
[2]
G.Aggarwal, T.Feder, K.Kenthapadi, S.Khuller, R. Panigrahy, D. Thomas, A. Zhu, Achieving anonymity via clustering, in: Proceedings of the Twenty-Fifth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 2006, pp. 153-162.
[3]
Beckwith, R.M.B.A., Balis, U.J. and Kuo, F., Development and evaluation of an open source software tool for deidentification of pathology reports. BMC Medical Informatics and Decision Making. v6 i12.
[4]
R.J. Bayardo, R. Agrawal, Data privacy through optimal k-anonymization, in: ICDE'05: Proceedings of the 21st International Conference on Data Engineering (ICDE'05), Washington, DC, USA, IEEE Computer Society, 2005, pp. 217-228.
[5]
E. Bertino, B. Ooi, Y. Yang, R.H. Deng. Privacy and ownership preserving of outsourced medical data, in: Proceedings of the 21st International Conference on Data Engineering, ICDE 2005, 2005.
[6]
I. Bhattacharya, L. Getoor, Iterative record linkage for cleaning and integration, in: DMKD'04: Proceedings of the 9th ACM SIGMOD Workshop on Research issues in Data Mining and Knowledge Discovery, 2004.
[7]
Y. Bu, A. Fu, R. Wong, L. Chen, J. Li. Privacy preserving serial data publishing by role composition, in: Thirty-fourth International Conference on Very Large Data Bases (VLDB), 2008.
[8]
G. Cormode, D. Srivastava, T. Yu, Q. Zhang, Anonymizing bipartite graph data using safe groupings, in: Thirty-fourth International Conference on Very Large Data Bases (VLDB), 2008.
[9]
A. Culotta, A. McCallum, J. Betz, Integrating probabilistic extraction models and data mining to discover relations and patterns in text, in: HLT/NAACL, Morristown, NJ, USA, Association for Computational Linguistics, 2006, pp. 296-303.
[10]
X. Dong, A. Halevy, J. Madhavan, Reference reconciliation in complex information spaces, in: SIGMOD'05: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, 2005.
[11]
B.C.M. Fung, K. Wang, R. Chen, P.S. Yu, Privacy-preserving data publishing: a survey on recent developments, ACM Computing Surveys, 2010.
[12]
B.C.M. Fung, K. Wang, P.S. Yu, Top-down specialization for information and privacy preservation, in: Proceedings of the 21st IEEE International Conference on Data Engineering (ICDE 2005), Tokyo, Japan, 2005, pp. 205-216.
[13]
L. Gu, R. Baxter, D. Vickers, C. Rainsford, Record linkage: current practice and future directions.
[14]
Gupta, D., Saul, M. and Gilbertson, J., Evaluation of a deidentification (de-id) software engine to share pathology reports and clinical documents for research. American Journal of Clinical Pathology. 76-186.
[15]
V.S. Iyengar, Transforming data to satisfy privacy constraints, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002, pp. 279-288.
[16]
P. Jurczyk, J.J. Lu, L. Xiong, J.D. Cragan, A. Correa, Fril: a tool for comparative record linkage, in: AMIA 2008 Annual Symposium, 2008.
[17]
D.V. Kalashnikov, S. Mehrotra, Z. Chen, Exploiting relationships for domain-independent data cleaning, in: SIAM International Conference on Data Mining, 2005.
[18]
D. Kifer, J. Gehrke. Injecting utility into anonymized datasets, in: SIGMOD Conference, 2006, pp. 217-228.
[19]
J. Lafferty, A. McCallum, F. Pereira, Conditional random fields: probabilistic models for segmenting and labeling sequence data, in: Proceedings of the 18th International Conference on Machine Learning, 2001.
[20]
R. Leaman, G.G. Banner, An executable survey of advances in biomedical named entity recognition, in: Pacific Symposium on Biocomputing, 2008.
[21]
K. LeFevre, D. Dewitt, R. Ramakrishnan, Incognito: efficient full-domain k-anonymity, in: ACM SIGMOD International Conference on Management of Data, 2005.
[22]
K. LeFevre, D. DeWitt, R. Ramakrishnan, Mondrian multidimensional k-anonymity, in: Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 2006.
[23]
N. Li, T. Li, T-closeness: privacy beyond k-anonymity and l-diversity, in: Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, 2007.
[24]
A. Machanavajjhala, J. Gehrke, D. Kifer, M. Venkitasubramaniam, L-diversity: privacy beyond k-anonymity, in: Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), 2006, pp. 24.
[25]
Manning, C. and Schutze, H., Foundations of Statistical Natural Language Processing. 1999. MIT Press.
[26]
D.J. Martin, D. Kifer, A. Machanavajjhala, J. Gehrke, J.Y. Halpern, Worst-case background knowledge for privacy-preserving data publishing, in: Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, 2007.
[27]
A. McCallum, Efficiently inducing features of conditional random fields, in: 19th Conference in Uncertainty in Articifical Intelligence (UAI), 2003.
[28]
A.K. McCallum, Mallet: a machine learning for language toolkit. <http://mallet.cs.umass.edu>, 2002.
[29]
A. Meyerson, R. Williams, On the complexity of optimal k-anonymity, in: Proceedings of the Twenty-third ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 2004, pp. 223-228.
[30]
Nadeau, D. and Sekine, S., A survey of named entity recognition and classification. Linguisticae Investigationes. v30 i7.
[31]
M.E. Nergiz, M. Atzori, C. Clifton, Hiding the presence of individuals from shared databases, in: SIGMOD Conference, 2007, pp. 665-676.
[32]
T. Sibanda, O. Uzuner, Role of local context in de-identification of ungrammatical fragmented text, in: North American Chapter of Association for Computational Linguistics/Human Language Technology, 2006.
[33]
Sweeney, L., Replacing personally-identifying information in medical the records scrub system. Journal of the American Informatics Association. 333-337.
[34]
L. Sweeney, Guaranteeing anonymity when sharing medical data, the datafly system, in: Proceedings of AMIA Annual Fall Symposium, 1997.
[35]
Sweeney, L., k-Anonymity: a model for protecting privacy. International Journal on Uncertainty Fuzziness, and Knowledge-based Systems. v10 i5.
[36]
R.K. Taira, A.A. Bui, H. Kangarloo, Identification of patient name references within medical documents using semantic selectional restrictions, in: Proceedings of AMIA Symposium, 2002, pp. 757-761.
[37]
S.M. Thomas, B. Mamlin, G.S. Adn, C. McDonald, A successful technique for removing names in pathology reports, in: Proceedings of AMIA Symposium, 2002, pp. 777-781.
[38]
T.M. Truta, B. Vinay, Privacy protection: p-sensitive k-anonymity property, in: Proceedings of the 22nd International Conference on Data Engineering Workshops, ICDE 2006, 2006, pp. 94.
[39]
Uzuner, O., Luo, Y. and Szolovits, P., Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association. v14 i5.
[40]
K. Wang, B.C.M. Fung, Anonymizing sequential releases, in: Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006.
[41]
K. Wang, P.S. Yu, S. Chakraborty, Bottom-up generalization: a data mining solution to privacy protection, in: Proceedings of the 4th IEEE International Conference on Data Mining (ICDM 2004), November 2004.
[42]
W. Winkler, Overview of record linkage and current research directions, Technical Report Statistics #2006-2, Statistical Research Division, US Bureau of the Census, 2006.
[43]
X. Xiao, Y. Tao, Anatomy: simple and effective privacy preservation, in: Thirrty-second International Conference on Very Large Databases (VLDB), 2006, pp. 139-150.
[44]
X. Xiao, Y. Tao, M-invariance: towards privacy preserving re-publication of dynamic datasets, in: SIGMOD Conference, 2007, pp. 689-700.
[45]
Zadrozny, B., Langford, J. and Abe, N., Cost-sensitive learning by cost-proportionate example weighting. In: ICDM'03: Proceedings of the Third IEEE International Conference on Data Mining, Washington, DC, USA, IEEE Computer Society.
[46]
Q. Zhang, N. Koudas, D. Srivastava, T. Yu, Aggregate query answering on anonymized tables, in: Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, 2007, pp. 116-125.
[47]
S. Zhong, Z. Yang, R.N. Wright, Privacy-enhancing k-anonymization of customer data, in: Proceedings of the Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 2005.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Data & Knowledge Engineering
Data & Knowledge Engineering  Volume 68, Issue 12
December, 2009
187 pages

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 01 December 2009

Author Tags

  1. Anonymization
  2. Conditional random fields
  3. Cost-proportionate sampling
  4. Data linkage
  5. Medical text
  6. Named entity recognition

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)De-identifying Australian hospital discharge summariesJournal of Biomedical Informatics10.1016/j.jbi.2022.104215135:COnline publication date: 1-Nov-2022
  • (2020)Imbalanced fault diagnosis of rotating machinery via multi-domain feature extraction and cost-sensitive learningJournal of Intelligent Manufacturing10.1007/s10845-019-01522-831:6(1467-1481)Online publication date: 1-Aug-2020
  • (2017)Anonymizing and Sharing Medical Text RecordsInformation Systems Research10.1287/isre.2016.067628:2(332-352)Online publication date: 1-Jun-2017
  • (2016)Privacy Preserving Unstructured Big Data AnalyticsProcedia Computer Science10.1016/j.procs.2016.02.02078:C(120-124)Online publication date: 1-Mar-2016
  • (2016)Optimizing annotation resources for natural language de-identification via a game theoretic frameworkJournal of Biomedical Informatics10.1016/j.jbi.2016.03.01961:C(97-109)Online publication date: 1-Jun-2016
  • (2015)Software Architecture for Document AnonymizationElectronic Notes in Theoretical Computer Science (ENTCS)10.5555/2793733.2794041314:C(83-100)Online publication date: 15-Jun-2015
  • (2015)A frequency-filtering strategy of obtaining PHI-free sentences from clinical data repositoryProceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/2808719.2808752(315-324)Online publication date: 9-Sep-2015
  • (2012)A hybrid stepwise approach for de-identifying person names in clinical documentsProceedings of the 2012 Workshop on Biomedical Natural Language Processing10.5555/2391123.2391132(65-72)Online publication date: 8-Jun-2012
  • (2010)Relationships and data sanitizationProceedings of the 2010 New Security Paradigms Workshop10.1145/1900546.1900567(151-164)Online publication date: 21-Sep-2010
  • (2010)An evaluation of feature sets and sampling techniques for de-identification of medical recordsProceedings of the 1st ACM International Health Informatics Symposium10.1145/1882992.1883019(183-190)Online publication date: 11-Nov-2010
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media