Automatic Evaluation of Disclosure Risks of Text Anonymization Methods

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13463))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

868 Accesses
2 Citations

Abstract

The standard approach to evaluate text anonymization methods consists of comparing their outcomes with the anonymization performed by human experts. The degree of privacy protection attained is then measured with the IR-based recall metric, which expresses the proportion of re-identifying terms that were correctly detected by the anonymization method. However, the use of recall to estimate the degree of privacy protection suffers from several limitations. The first is that it assigns a uniform weight to each re-identifying term, thereby ignoring the fact that some missed re-identifying terms may have a larger influence on the disclosure risk than others. Furthermore, IR-based metrics assume the existence of a single gold standard annotation. This assumption does not hold for text anonymization, where several maskings (each one encompassing a different combination of terms) could be equally valid to prevent disclosure. Finally, those metrics rely on manually anonymized datasets, which are inherently subjective and may be prone to various errors, omissions and inconsistencies. To tackle these issues, we propose an automatic re-identification attack for (anonymized) texts that provides a realistic assessment of disclosure risks. Our method follows a similar premise as the well-known record linkage methods employed to evaluate anonymized structured data, and leverages state-of-the-art deep learning language models to exploit the background knowledge available to potential attackers. We also report empirical evaluations of several well-known methods and tools for text anonymization. Results show significant re-identification risks for all methods, including also manual anonymization efforts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack

Article Open access 03 September 2024

Towards Quantifying the Privacy of Redacted Text

Man vs the machine in the struggle for effective text anonymisation in the age of large language models

Article Open access 25 September 2023

Notes

References

Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April on the Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement of Such Data and Repealing Directive 95/46/EC. In: Commission, E. (ed.) (2016)
Google Scholar
Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and l-diversity. In: 2007 IEEE 23rd International Conference on Data Engineering, pp. 106–115. IEEE (2007)
Google Scholar
Machanavajjhala, A., Kifer, D., Gehrke, J., Venkitasubramaniam, M.: l-diversity: privacy beyond k-anonymity. ACM Trans. Knowl. Disc. Data (TKDD) 1, 3-es (2007)
Google Scholar
Sweeney, L.: k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzz. Knowl. Based Syst. 10, 557–570 (2002)
Article MathSciNet Google Scholar
Dwork, C.: Differential privacy. In: International Colloquium on Automata, Languages, and Programming, pp. 1–12. Springer (2006)
Google Scholar
Lison, P., Pilán, I., Sánchez, D., Batet, M., Øvrelid, L.: Anonymisation models for text data: state of the art, challenges and future directions. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol. 1, Long Papers, pp. 4188–4203 (2021)
Google Scholar
Csányi, G.M., Nagy, D., Vági, R., Vadász, J.P., Orosz, T.: Challenges and open problems of legal document anonymization. Symmetry 13, 1490 (2021)
Article Google Scholar
Aberdeen, J., et al.: The MITRE identification scrubber toolkit: design, training, and assessment. Int. J. Med. Informatics 79, 849–859 (2010)
Article Google Scholar
Chen, A., Jonnagaddala, J., Nekkantti, C., Liaw, S.-T.: Generation of surrogates for de-identification of electronic health records. In: MEDINFO 2019: Health and Wellbeing e-Networks for All, pp. 70–73. IOS Press (2019)
Google Scholar
Dernoncourt, F., Lee, J.Y., Uzuner, O., Szolovits, P.: De-identification of patient notes with recurrent neural networks. J. Am. Med. Inform. Assoc. 24, 596–606 (2017)
Article Google Scholar
Johnson, A.E., Bulgarelli, L., Pollard, T.J.: Deidentification of free-text medical records using pre-trained bidirectional transformers. In: Proceedings of the ACM Conference on Health, Inference, and Learning, pp. 214–221 (2020)
Google Scholar
Liu, Z., Tang, B., Wang, X., Chen, Q.: De-identification of clinical notes via recurrent neural network and conditional random field. J. Biomed. Inform. 75, S34–S42 (2017)
Article Google Scholar
Mamede, N., Baptista, J., Dias, F.: Automated anonymization of text documents. In: 2016 IEEE Congress on Evolutionary Computation (CEC), pp. 1287–1294. IEEE (2016)
Google Scholar
Meystre, S.M., Friedlin, F.J., South, B.R., Shen, S., Samore, M.H.: Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med. Res. Methodol. 10, 1–16 (2010)
Article Google Scholar
Neamatullah, I., et al.: Automated de-identification of free-text medical records. BMC Med. Inform. Decis. Mak. 8, 1–17 (2008)
Article Google Scholar
Reddy, S., Knight, K.: Obfuscating gender in social media writing. In: Proceedings of the First Workshop on NLP and Computational Social Science, pp. 17–26 (2016)
Google Scholar
Sweeney, L.: Replacing personally-identifying information in medical records, the Scrub system. In: Proceedings of the AMIA Annual Fall Symposium, p. 333. American Medical Informatics Association (1996)
Google Scholar
Szarvas, G., Farkas, R., Busa-Fekete, R.: State-of-the-art anonymization of medical records using an iterative machine learning framework. J. Am. Med. Inform. Assoc. 14, 574–580 (2007)
Article Google Scholar
Xu, Q., Qu, L., Xu, C., Cui, R.: Privacy-aware text rewriting. In: Proceedings of the 12th International Conference on Natural Language Generation, pp. 247–257 (2019)
Google Scholar
Yang, H., Garibaldi, J.M.: Automatic detection of protected health information from clinic narratives. J. Biomed. Inform. 58, S30–S38 (2015)
Article Google Scholar
Sánchez, D., Batet, M.: C-sanitized: a privacy model for document redaction and sanitization. J. Am. Soc. Inf. Sci. 67, 148–163 (2016)
Google Scholar
Mosallanezhad, A., Beigi, G., Liu, H.: Deep reinforcement learning-based text anonymization against private-attribute inference. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2360–2369 (2019)
Google Scholar
Chakaravarthy, V.T., Gupta, H., Roy, P., Mohania, M.K.: Efficient techniques for document sanitization. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 843–852 (2008)
Google Scholar
Fernandes, N., Dras, M., McIver, A.: Generalised differential privacy for text document processing. In: Nielson, F., Sands, D. (eds.) Principles of Security and Trust. LNCS, vol. 11426, pp. 123–148. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-17138-4_6
Chapter Google Scholar
Cumby, C., Ghani, R.: A machine learning based system for semi-automatically redacting documents. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1628–1635 (2011)
Google Scholar
Anandan, B., Clifton, C., Jiang, W., Murugesan, M., Pastrana-Camacho, P., Si, L.: t-Plausibility: generalizing words to desensitize text. Trans. Data Priv. 5, 505–534 (2012)
MathSciNet Google Scholar
Hassan, F., Sanchez, D., Domingo-Ferrer, J.: Utility-preserving privacy protection of textual documents via word embeddings. IEEE Trans. Knowl. Data Eng. 1 (2021)
Google Scholar
Hundepool, A., et al.: Statistical Disclosure Control. Wiley, New York (2012)
Book Google Scholar
Pilán, I., Lison, P., Øvrelid, L., Papadopoulou, A., Sánchez, D., Batet, M.: The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization. arXiv preprint arXiv:2202.00443 (2022)
Domingo-Ferrer, J., Torra, V.J.S.: Computing: disclosure risk assessment in statistical microdata protection via advanced record linkage. Statist. Comput. 13, 343–354 (2003)
Google Scholar
Nin Guerrero, J., Herranz Sotoca, J., Torra i Reventós, V.: On method-specific record linkage for risk assessment. In: Proceedings of the Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, pp. 1–12 (2007)
Google Scholar
Torra, V., Abowd, J.M., Domingo-Ferrer, J.: Using Mahalanobis distance-based record linkage for disclosure risk assessment. In: DomingoFerrer, J., Franconi, L. (eds.) Privacy in Statistical Databases. LNCS, vol. 4302, pp. 233–242. Springer, Heidelberg (2006). https://doi.org/10.1007/11930242_20
Chapter Google Scholar
Torra, V., Stokes, K.J.I.J.o.U., Fuzziness, Systems, K.-B.: A formalization of record linkage and its application to data protection. Int. J. Uncert. Fuzz. Knowl. Based Syst. 20, 907–919 (2012)
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Mozes, M., Kleinberg, B.J.: No Intruder, no Validity: Evaluation Criteria for Privacy-Preserving Text Anonymization (2021)
Google Scholar
Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
MATH Google Scholar
Liu, Y., Liu, Z., Chua, T.-S., Sun, M.: Topical word embeddings. In: Twenty-Ninth AAAI Conference on Artificial Intelligence (2015)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
El-Kassas, W.S., Salama, C.R., Rafea, A.A., Mohamed, H.K.: Automatic text summarization: a comprehensive survey. Expert Syst. Appl. 165, 113679 (2021)
Article Google Scholar
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014)
Google Scholar
Sánchez, D., Batet, M.: Toward sensitive document release with privacy guarantees. Eng. Appl. Artif. Intell. 59, 23–34 (2017)
Article Google Scholar
Staddon, J., Golle, P., Zimny, B.: Web-based inference detection. In: USENIX Security Symposium (2007)
Google Scholar

Download references

Acknowledgements

Partial support to this work has been received from the Norwegian Research Council (CLEANUP project, grant nr. 308904), the European Commission (projects H2020-871042 “SoBigData++” and H2020-101006879 “MobiDataLab”) and the Government of Catalonia (ICREA Acadèmia Prize to D. Sánchez). The opinions in this paper are the authors’ own and do not commit UNESCO or any of the funders.

Author information

Authors and Affiliations

Department of Computer Engineering and Mathematics, UNESCO Chair in Data Privacy, CYBERCAT, Universitat Rovira i Virgili, Tarragona, Spain
Benet Manzanares-Salor & David Sánchez
Norwegian Computing Center, Oslo, Norway
Pierre Lison

Authors

Benet Manzanares-Salor
View author publications
You can also search for this author in PubMed Google Scholar
David Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Lison
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to David Sánchez .

Editor information

Editors and Affiliations

Universitat Rovira i Virgili, Tarragona, Catalonia, Spain
Josep Domingo-Ferrer
Télécom SudParis, Palaiseau, France
Maryline Laurent

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Manzanares-Salor, B., Sánchez, D., Lison, P. (2022). Automatic Evaluation of Disclosure Risks of Text Anonymization Methods. In: Domingo-Ferrer, J., Laurent, M. (eds) Privacy in Statistical Databases. PSD 2022. Lecture Notes in Computer Science, vol 13463. Springer, Cham. https://doi.org/10.1007/978-3-031-13945-1_12

Download citation

DOI: https://doi.org/10.1007/978-3-031-13945-1_12
Published: 14 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13944-4
Online ISBN: 978-3-031-13945-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics