[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3469096.3469871acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
research-article

A novel approach on the joint de-identification of textual and relational data with a modified mondrian algorithm

Published: 16 August 2021 Publication History

Abstract

Traditional approaches for data anonymization consider relational data and textual data independently. We propose rx-anon, an anonymization approach for heterogeneous semi-structured documents composed of relational and textual attributes. We map sensitive terms extracted from the text to the structured data. This allows us to use concepts like k-anonymity to generate a joint, privacy-preserved version of the heterogeneous data input. We introduce the concept of redundant sensitive information to consistently anonymize the heterogeneous data. To control the influence of anonymization over unstructured textual data versus structured data attributes, we introduce a modified, parameterized Mondrian algorithm. We evaluate our approach with two real-world datasets using a Normalized Certainty Penalty score, adapted to the problem of jointly anonymizing relational and textual data. The results show that our approach is capable of reducing information loss by using the tuning parameter to control the Mondrian partitioning while guaranteeing k-anonymity. As rx-anon is a framework approach, it can be reused and extended by other anonymization algorithms, privacy models, and textual similarity metrics.

Supplementary Material

PDF File (a14-singhofer-supp.pdf)
Supplemental material.

References

[1]
Venkatesan T. Chakaravarthy, Himanshu Gupta, Prasan Roy, and Mukesh K. Mohania. 2008. Efficient techniques for document sanitization. In Int. Conf. on Information and Knowledge Mining (CIKM). ACM, 843--852.
[2]
Franck Dernoncourt, Ji Young Lee, Ozlem Uzuner, and Peter Szolovits. 2017. De-identification of patient notes with recurrent neural networks. J. of the American Medical Informatics Association 24, 3 (2017), 596--606. arXiv:1606.03475
[3]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. ACL, 4171--4186.
[4]
Elisabeth Eder, Ulrike Krieg-Holz, and Udo Hahn. 2019. De-Identification of Emails: Pseudonymizing Privacy-Sensitive Data in a German Email Corpus. In Int. Conf. on Recent Advances in Natural Language Processing. Incoma Ltd., 259--269.
[5]
Khaled El Emam, Fida Kamal Dankar, Romeo Issa, and others. 2009. A Globally Optimal k-Anonymity Method for the De-Identification of Health Data. J. of the American Medical Informatics Association 16, 5 (2009), 670--682.
[6]
James Gardner and Li Xiong. 2008. HIDE: An Integrated System for Health Information DE-identification. In Int. Symposium on Computer-Based Medical Systems. IEEE, 254--259.
[7]
Gabriel Ghinita, Panagiotis Karras, Panos Kalnis, and Nikos Mamoulis. 2007. Fast data anonymization with low information loss. In VLDB. ACM, 758--769.
[8]
Olga Gkountouna and Manolis Terrovitis. 2015. Anonymizing Collections of Tree-Structured Data. Trans. Knowl. Data Eng. 27, 8 (2015), 2034--2048.
[9]
Qiyuan Gong, Junzhou Luo, Ming Yang, Weiwei Ni, and Xiao Bai Li. 2017. Anonymizing 1:M microdata with high utility. Knowledge-Based Systems 115 (2017), 15--26.
[10]
Yeye He and Jeffrey F. Naughton. 2009. Anonymization of Set-Valued Data via Top-Down, Local Generalization. VLDB 2, 1 (2009), 934--945.
[11]
Alistair E. W. Johnson, Lucas Bulgarelli, and Tom J. Pollard. 2020. Deidentification of free-text medical records using pre-trained bidirectional transformers. In ACM Conf. on Health, Inference, and Learning. ACM, 214--221.
[12]
Moshe Koppel, Jonathan Schler, Shlomo Argamon, and Eran Messeri. 2006. Authorship attribution with thousands of candidate authors. In Int. Conf. on Research and Development in Information Retrieval (SIGIR), Vol. 2006. ACM, 659--660.
[13]
Kristen LeFevre, David J. DeWitt, and Raghu Ramakrishnan. 2006. Mondrian Multidimensional K-Anonymity. In ICDE. IEEE, 25--25.
[14]
Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. 2007. t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. In ICDE. IEEE, 106--115.
[15]
Zengjian Liu, Buzhou Tang, Xiaolong Wang, and Qingcai Chen. 2017. De-identification of clinical notes via recurrent neural network and conditional random field. J. of Biomedical Informatics 75 (2017), S34--S42.
[16]
Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, and Muthuramakrishnan Venkitasubramaniam. 2006. L-diversity: privacy beyond k-anonymity. In ICDE. IEEE, 24--24.
[17]
Tomás Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS, Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger (Eds.). 3111--3119.
[18]
Ishna Neamatullah, Margaret M. Douglass, Li-wei H Lehman, and others. 2008. Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making 8 (2008), 32.
[19]
Mehmet Ercan Nergiz, Christopher Clifton, and Ahmet Erhan Nergiz. 2007. MultiRelational k-Anonymity. In ICDE, Vol. 21. IEEE, 1417--1421.
[20]
Giorgos Poulis, Grigorios Loukides, Aris Gkoulalas-Divanis, and Spiros Skiadopoulos. 2013. Anonymizing Data with Relational and Transaction Attributes. In Machine Learning and Knowledge Discovery in Databases. Springer, 353--369.
[21]
Patrick Ruch, Robert H. Baud, Anne-Marie Rassinoux, Pierrette Bouillon, and Gilbert Robert. 2000. Medical document anonymization with a semantic lexicon. In American Medical Informatics Association Annual Symposium. AMIA, 729--733.
[22]
Pierangela Samarati. 2001. Protecting respondents identities in microdata release. IEEE Trans. on Knowledge and Data Engineering 13, 6 (2001), 1010--1027.
[23]
David Sánchez, Montserrat Batet, and Alexandre Viejo. 2013. Automatic general-purpose sanitization of textual documents. IEEE Trans. on Information Forensics and Security 8, 6 (2013), 853--862.
[24]
Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James Pennebaker. 2006. Effects of age and gender on blogging. In Computational Approaches to Analyzing Weblogs. AAAI, 199--205.
[25]
Latanya Sweeney. 1996. Replacing personally-identifying information in medical records, the Scrub system. In Proceedings : a Conf. of the American Medical Informatics Association. AMIA Fall Symposium. AMIA, 333--337.
[26]
Latanya Sweeney. 2000. Simple Demographics Often Identify People Uniquely. In Data Privacy Working Paper 3. Carnegie Mellon U.
[27]
Latanya Sweeney. 2002. Achieving k-Anonymity Privacy Protection Using Generalization and Suppression. Int. J. of Uncertainty, Fuzziness and Knowledge-Based Systems 10, 5 (2002), 571--588.
[28]
Latanya Sweeney. 2002. k-Anonymity: A Model for Protecting Privacy. Int. J. of Uncertainty, Fuzziness and Knowledge-Based Systems 10, 5 (2002), 557--570.
[29]
Irene Teinemaa, Marlon Dumas, Fabrizio Maria Maggi, and Chiara Di Francescomarino. 2016. Predictive business process monitoring with structured and unstructured data. In Business Process Management, Vol. 9850. Springer, 401--417.
[30]
Manolis Terrovitis, Nikos Mamoulis, and Panos Kalnis. 2008. Privacy-preserving anonymization of set-valued data. VLDB 1, 1 (2008), 115--125.
[31]
Jan Trienes, Dolf Trieschnigg, Christin Seifert, and Djoerd Hiemstra. 2020. Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records. In Health Search and Data Mining, Vol. 2551. CEUR, 3--11.
[32]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In Conf. on Neural Information Processing Systems. Curran, 5998--6008.
[33]
Jian Xu, Wei Wang, Jian Pei, Xiaoyuan Wang, Baile Shi, and Ada Wai-Chee Fu. 2006. Utility-based anonymization using local recoding. In SIGKDD. ACM, 785--790.
[34]
Ying Zhao and Charles C. Zhou. 2020. Link Analysis to Discover Insights from Structured and Unstructured Data on COVID-19. In Bioinformatics, Computational Biology and Health Informatics. ACM, 1--8.

Cited By

View all
  • (2023)Memorization of Named Entities in Fine-Tuned BERT ModelsMachine Learning and Knowledge Extraction10.1007/978-3-031-40837-3_16(258-279)Online publication date: 22-Aug-2023

Index Terms

  1. A novel approach on the joint de-identification of textual and relational data with a modified mondrian algorithm

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      DocEng '21: Proceedings of the 21st ACM Symposium on Document Engineering
      August 2021
      178 pages
      ISBN:9781450385961
      DOI:10.1145/3469096
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 16 August 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Badges

      • Best Paper

      Author Tags

      1. data anonymization
      2. heterogeneous data
      3. k-anonymity

      Qualifiers

      • Research-article

      Conference

      DocEng '21
      Sponsor:
      DocEng '21: ACM Symposium on Document Engineering 2021
      August 24 - 27, 2021
      Limerick, Ireland

      Acceptance Rates

      Overall Acceptance Rate 194 of 564 submissions, 34%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)15
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 24 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Memorization of Named Entities in Fine-Tuned BERT ModelsMachine Learning and Knowledge Extraction10.1007/978-3-031-40837-3_16(258-279)Online publication date: 22-Aug-2023

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media