More Web Proxy on the site http://driver.im/

research-article

A novel approach on the joint de-identification of textual and relational data with a modified mondrian algorithm

Authors:

A. Garifullina,

A. ScherpAuthors Info & Claims

DocEng '21: Proceedings of the 21st ACM Symposium on Document Engineering

Article No.: 14, Pages 1 - 10

https://doi.org/10.1145/3469096.3469871

Published: 16 August 2021 Publication History

Abstract

Traditional approaches for data anonymization consider relational data and textual data independently. We propose rx-anon, an anonymization approach for heterogeneous semi-structured documents composed of relational and textual attributes. We map sensitive terms extracted from the text to the structured data. This allows us to use concepts like k-anonymity to generate a joint, privacy-preserved version of the heterogeneous data input. We introduce the concept of redundant sensitive information to consistently anonymize the heterogeneous data. To control the influence of anonymization over unstructured textual data versus structured data attributes, we introduce a modified, parameterized Mondrian algorithm. We evaluate our approach with two real-world datasets using a Normalized Certainty Penalty score, adapted to the problem of jointly anonymizing relational and textual data. The results show that our approach is capable of reducing information loss by using the tuning parameter to control the Mondrian partitioning while guaranteeing k-anonymity. As rx-anon is a framework approach, it can be reused and extended by other anonymization algorithms, privacy models, and textual similarity metrics.

Supplementary Material

PDF File (a14-singhofer-supp.pdf)

Supplemental material.

Download
655.29 KB

References

[1]

Venkatesan T. Chakaravarthy, Himanshu Gupta, Prasan Roy, and Mukesh K. Mohania. 2008. Efficient techniques for document sanitization. In Int. Conf. on Information and Knowledge Mining (CIKM). ACM, 843--852.

[2]

Franck Dernoncourt, Ji Young Lee, Ozlem Uzuner, and Peter Szolovits. 2017. De-identification of patient notes with recurrent neural networks. J. of the American Medical Informatics Association 24, 3 (2017), 596--606. arXiv:1606.03475

[3]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. ACL, 4171--4186.

[4]

Elisabeth Eder, Ulrike Krieg-Holz, and Udo Hahn. 2019. De-Identification of Emails: Pseudonymizing Privacy-Sensitive Data in a German Email Corpus. In Int. Conf. on Recent Advances in Natural Language Processing. Incoma Ltd., 259--269.

[5]

Khaled El Emam, Fida Kamal Dankar, Romeo Issa, and others. 2009. A Globally Optimal k-Anonymity Method for the De-Identification of Health Data. J. of the American Medical Informatics Association 16, 5 (2009), 670--682.

[6]

James Gardner and Li Xiong. 2008. HIDE: An Integrated System for Health Information DE-identification. In Int. Symposium on Computer-Based Medical Systems. IEEE, 254--259.

[7]

Gabriel Ghinita, Panagiotis Karras, Panos Kalnis, and Nikos Mamoulis. 2007. Fast data anonymization with low information loss. In VLDB. ACM, 758--769.

[8]

Olga Gkountouna and Manolis Terrovitis. 2015. Anonymizing Collections of Tree-Structured Data. Trans. Knowl. Data Eng. 27, 8 (2015), 2034--2048.

[9]

Qiyuan Gong, Junzhou Luo, Ming Yang, Weiwei Ni, and Xiao Bai Li. 2017. Anonymizing 1:M microdata with high utility. Knowledge-Based Systems 115 (2017), 15--26.

[10]

Yeye He and Jeffrey F. Naughton. 2009. Anonymization of Set-Valued Data via Top-Down, Local Generalization. VLDB 2, 1 (2009), 934--945.

Digital Library

[11]

Alistair E. W. Johnson, Lucas Bulgarelli, and Tom J. Pollard. 2020. Deidentification of free-text medical records using pre-trained bidirectional transformers. In ACM Conf. on Health, Inference, and Learning. ACM, 214--221.

[12]

Moshe Koppel, Jonathan Schler, Shlomo Argamon, and Eran Messeri. 2006. Authorship attribution with thousands of candidate authors. In Int. Conf. on Research and Development in Information Retrieval (SIGIR), Vol. 2006. ACM, 659--660.

Digital Library

[13]

Kristen LeFevre, David J. DeWitt, and Raghu Ramakrishnan. 2006. Mondrian Multidimensional K-Anonymity. In ICDE. IEEE, 25--25.

[14]

Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. 2007. t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. In ICDE. IEEE, 106--115.

[15]

Zengjian Liu, Buzhou Tang, Xiaolong Wang, and Qingcai Chen. 2017. De-identification of clinical notes via recurrent neural network and conditional random field. J. of Biomedical Informatics 75 (2017), S34--S42.

Digital Library

[16]

Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, and Muthuramakrishnan Venkitasubramaniam. 2006. L-diversity: privacy beyond k-anonymity. In ICDE. IEEE, 24--24.

[17]

Tomás Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS, Christopher J. C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q. Weinberger (Eds.). 3111--3119.

[18]

Ishna Neamatullah, Margaret M. Douglass, Li-wei H Lehman, and others. 2008. Automated de-identification of free-text medical records. BMC Medical Informatics and Decision Making 8 (2008), 32.

[19]

Mehmet Ercan Nergiz, Christopher Clifton, and Ahmet Erhan Nergiz. 2007. MultiRelational k-Anonymity. In ICDE, Vol. 21. IEEE, 1417--1421.

[20]

Giorgos Poulis, Grigorios Loukides, Aris Gkoulalas-Divanis, and Spiros Skiadopoulos. 2013. Anonymizing Data with Relational and Transaction Attributes. In Machine Learning and Knowledge Discovery in Databases. Springer, 353--369.

[21]

Patrick Ruch, Robert H. Baud, Anne-Marie Rassinoux, Pierrette Bouillon, and Gilbert Robert. 2000. Medical document anonymization with a semantic lexicon. In American Medical Informatics Association Annual Symposium. AMIA, 729--733.

[22]

Pierangela Samarati. 2001. Protecting respondents identities in microdata release. IEEE Trans. on Knowledge and Data Engineering 13, 6 (2001), 1010--1027.

Digital Library

[23]

David Sánchez, Montserrat Batet, and Alexandre Viejo. 2013. Automatic general-purpose sanitization of textual documents. IEEE Trans. on Information Forensics and Security 8, 6 (2013), 853--862.

Digital Library

[24]

Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James Pennebaker. 2006. Effects of age and gender on blogging. In Computational Approaches to Analyzing Weblogs. AAAI, 199--205.

[25]

Latanya Sweeney. 1996. Replacing personally-identifying information in medical records, the Scrub system. In Proceedings : a Conf. of the American Medical Informatics Association. AMIA Fall Symposium. AMIA, 333--337.

[26]

Latanya Sweeney. 2000. Simple Demographics Often Identify People Uniquely. In Data Privacy Working Paper 3. Carnegie Mellon U.

[27]

Latanya Sweeney. 2002. Achieving k-Anonymity Privacy Protection Using Generalization and Suppression. Int. J. of Uncertainty, Fuzziness and Knowledge-Based Systems 10, 5 (2002), 571--588.

Digital Library

[28]

Latanya Sweeney. 2002. k-Anonymity: A Model for Protecting Privacy. Int. J. of Uncertainty, Fuzziness and Knowledge-Based Systems 10, 5 (2002), 557--570.

Digital Library

[29]

Irene Teinemaa, Marlon Dumas, Fabrizio Maria Maggi, and Chiara Di Francescomarino. 2016. Predictive business process monitoring with structured and unstructured data. In Business Process Management, Vol. 9850. Springer, 401--417.

[30]

Manolis Terrovitis, Nikos Mamoulis, and Panos Kalnis. 2008. Privacy-preserving anonymization of set-valued data. VLDB 1, 1 (2008), 115--125.

Digital Library

[31]

Jan Trienes, Dolf Trieschnigg, Christin Seifert, and Djoerd Hiemstra. 2020. Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records. In Health Search and Data Mining, Vol. 2551. CEUR, 3--11.

[32]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. In Conf. on Neural Information Processing Systems. Curran, 5998--6008.

[33]

Jian Xu, Wei Wang, Jian Pei, Xiaoyuan Wang, Baile Shi, and Ada Wai-Chee Fu. 2006. Utility-based anonymization using local recoding. In SIGKDD. ACM, 785--790.

[34]

Ying Zhao and Charles C. Zhou. 2020. Link Analysis to Discover Insights from Structured and Unstructured Data on COVID-19. In Bioinformatics, Computational Biology and Health Informatics. ACM, 1--8.

Cited By

Diera ALell NGarifullina AScherp A(2023)Memorization of Named Entities in Fine-Tuned BERT ModelsMachine Learning and Knowledge Extraction10.1007/978-3-031-40837-3_16(258-279)Online publication date: 22-Aug-2023
https://doi.org/10.1007/978-3-031-40837-3_16

Index Terms

A novel approach on the joint de-identification of textual and relational data with a modified mondrian algorithm
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
2. Security and privacy
  1. Database and storage security
    1. Data anonymization and sanitization

Recommendations

Privacy consensus in anonymization systems via game theory
DBSec'12: Proceedings of the 26th Annual IFIP WG 11.3 conference on Data and Applications Security and Privacy

Privacy protection appears as a fundamental concern when personal data is collected, stored, and published. Several anonymization methods have been proposed to address privacy issues in private datasets. Every anonymization method has at least one ...
A novel anonymization algorithm: Privacy protection and knowledge preservation

In data mining and knowledge discovery, there are two conflicting goals: privacy protection and knowledge preservation. On the one hand, we anonymize data to protect privacy; on the other hand, we allow miners to discover useful knowledge from ...
Multi-criteria Optimization Using l-diversity and t-closeness for k-anonymization
Data Privacy Management, Cryptocurrencies and Blockchain Technology
Abstract
k-anonymity is a commonly used anonymization principle. It provides an anonymous table by grouping the individuals of the table in sets of at least k elements. This principle guarantees a good privacy while limiting the data alteration. Within the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

DocEng '21: Proceedings of the 21st ACM Symposium on Document Engineering

August 2021

178 pages

ISBN:9781450385961

DOI:10.1145/3469096

General Chairs:
Patrick Healy
University of Limerick, Ireland
,
Mihai Bilauca
University of Limerick, Ireland
,
Program Chair:
Alexandra Bonnici
University of Malta, Malta

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 August 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Best Paper

Author Tags

Qualifiers

Research-article

Conference

DocEng '21

Sponsor:

SIGWEB

DocEng '21: ACM Symposium on Document Engineering 2021

August 24 - 27, 2021

Limerick, Ireland

Acceptance Rates

Overall Acceptance Rate 194 of 564 submissions, 34%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
124
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Diera ALell NGarifullina AScherp A(2023)Memorization of Named Entities in Fine-Tuned BERT ModelsMachine Learning and Knowledge Extraction10.1007/978-3-031-40837-3_16(258-279)Online publication date: 22-Aug-2023
https://doi.org/10.1007/978-3-031-40837-3_16

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents