1 Introduction

Text is a ubiquitous mean of sharing information among humans, and defines the standard encoding for most unstructured or semi-structured knowledge sources. The availability of text data is therefore crucial for many data mining and machine learning tasks. Examples of textual sources employed for research include medical reports used in pharmacological studies (Meystre et al. 2010), publications in social networks employed for classification (Rivas and Hristidis 2021) or contextual analysis (Gutiérrez-Batista et al. 2018), and reviews or comments leveraged for customer satisfaction analysis (Zhao et al. 2019).

Textual data, however, often includes personal information. To comply with the European General Data Protection Regulation (GDPR) (Regulation (EU) 2016), privacy protection measures should be undertaken prior to releasing the data or sharing them with third parties. These measures consist on either obtaining the informed consent of the individuals the data refer to (which may be infeasible), or anonymize the data, a process by which it should be not possible to attribute the data to concrete subjects. The latter makes the data no longer personal and, therefore, outside the scope of the GDPR.

Methods for data anonymization have been extensively employed in data mining to conceal personal information in structured datasets consisting of records describing individuals by means of predefined attributes (Domingo-Ferrer and Torra 2005; Bertino et al. 2005; Agrawal et al. 2009; Hajian et al. 2015). Well-stablished anonymization methods and privacy models for this framework include k-anonymity and its extensions (Li et al. 2007; Machanavajjhala et al. 2007; Samarati 2001), and ε-differential privacy (Dwork 2006). Nonetheless, anonymization of unstructured data such as plain text is significantly more challenging (Lison et al. 2021; Csányi et al. 2021). Difficulties arise because re-identifying attributes appearing in textual resources are unbounded and, usually, not obviously associated to the corresponding subject. Due to these difficulties, text anonymization in real world applications is still mainly performed by human experts, who use their broad language understanding and contextual knowledge to identify and mask the information that may lead to re-identification (Bier et al. 2009).

The vast majority of automatic text anonymization methods use natural language processing (NLP) techniques to find and mask terms belonging to potentially re-identifying categories (Meystre et al. 2010; Aberdeen et al. 2010; Chen et al. 2019; Dernoncourt et al. 2017; Elazar and Goldberg 2018; Hassan et al. 2018; Huang et al. 2020; Johnson et al. 2020; Liu et al. 2017; Mamede et al. 2016; Neamatullah et al. 2008; Reddy and Knight 2016; Szarvas et al. 2007; Xu et al. 2019; Yang and Garibaldi 2015; Yogarajan et al. 2018), such as names or addresses. Named entity recognition (NER) techniques are employed to this end. However, because these techniques limit masking to a reduced set of predefined semantic categories -whereas re-identification can be caused by a large variety of entity types-, they offer poor privacy protection. In fact, as introduced above, it is well acknowledged in the data privacy literature that the types of information that may lead to re-identification are unbounded (Lison et al. 2021; Hassan et al. 2021). On the other hand, methods proposed in the area of privacy preserving data publishing (PPDP) (Hassan et al. 2021; Sánchez and Batet 2016; Mosallanezhad et al. 2019; Chakaravarthy et al. 2008; Fernandes et al. 2019; Cumby and Ghani 2011; Anandan et al. 2012) consider any information that jeopardizes subject’s anonymity. However, the distortion added by some of these methods and several scalability issues make them unfeasible in many real-world scenarios (Lison et al. 2021).

In any case, since the majority of text anonymization techniques (and all NLP-based, particularly) do not provide formal privacy guarantees, the level of protection attained should be evaluated ex post. In this respect, the de facto way to evaluate anonymization methods is to compare them with manual anonymizations (Meystre et al. 2010; Aberdeen et al. 2010; Dernoncourt et al. 2017; Hassan et al. 2018; Johnson et al. 2020; Liu et al. 2017; Mamede et al. 2016; Neamatullah et al. 2008; Szarvas et al. 2007; Yang and Garibaldi 2015; Sánchez and Batet 2016), which assumed to define the ground truth. Particularly, the IR-based metrics precision and recall are employed for measuring the performance of the anonymization approaches. Whereas a drop in precision indicates that terms were unnecessarily masked (which would negatively affect the utility and readability of the anonymized outcomes), recall, which quantifies the proportion of re-identifying terms that were correctly detected, is inversely related to the re-identification risk. However, recall is not a risk metric, but a completeness measure of a manual anonymization. This renders it severely limited for privacy assessment because (i) several masking choices -each one involving a different combination of terms- could be equally valid to prevent re-identification, (ii) recall calculation relies on manual annotations, which are prone to errors, omissions and inconsistencies, and (iii) not all (missed) terms contribute equally to re-identification (Lison et al. 2021; Pilán et al. 2022).

In contrast, the standard evaluation procedure for structured databases in the statistical disclosure control (SDC) field (Hundepool et al. 2012) consists of empirically measuring the residual disclosure risk by subjecting the anonymized data to re-identification attacks, more specifically, record linkage attacks (Domingo-Ferrer and Torra 2003; Nin Guerrero et al. 2007; Torra et al. 2006; Torra and Stokes 2012). Record linkage methods, which strictly focus on structured databases, match records in the anonymized database with background data containing publicly available identified information of the protected individuals. Since each successful match between the two data sources results in unequivocal re-identification, the proportion of correctly linked records offers an accurate empirical measurement of the re-identification risk.

1.1 Contributions and plan

To tackle the limitations of IR-based metrics and human-dependent evaluation of anonymized texts, in this paper we propose TRIA, a Text Re-Identification Attack that adapts the principles of the record linkage technique –which was designed and exclusively applied to structured databases–to textual documents. Under this framework, we also propose TRIR, a Text Re-Identification Risk empirical metric based on the re-identification accuracy of TRIA, which overcomes the drawbacks of the non-risk-specific recall-based privacy evaluation.

We define TRIA as a (highly dimensional) multiclass classification task, where anonymized texts must be associated to individuals (classes) known by the attacker. For maximizing the re-identification capabilities, TRIA leverages state-of-the-art neural language models (Devlin et al. 2019) to aggregate and characterize the data sources that attackers may use as background knowledge to conduct re-identifications. These models have been shown to achieve human or above human proficiency in many language-related tasks, thus making our approach a realistic depiction of an ideal human attacker work. However, because the unconstrained use of large language models can be computationally expensive, we have carefully considered the means and computational resources expected to be available at the attacker’s end in order to simulate a feasible re-identification attack.

We illustrate the effectiveness of our proposal by evaluating the level of privacy protection attained by a variety of text anonymization methods, along with a sample of human-based anonymization (Hassan et al. 2021) under a several (ideal or more realistic) attack configurations. The empirical results reported here illustrate the deficiencies of NER-based anonymization (already discussed at a theoretical level in Lison et al. (2021); Hassan et al. 2021)), and show substantial discrepancies against recall-based evaluations.

The rest of this paper is organized as follows. Section 2 provides background on (text) anonymization and language models. Section 3 discusses related work privacy evaluation for anonymized text. Section 4 presents and formalizes TRIA and TRIR. Section 5 details the implementation details for TRIA. Section 6 reports and discusses empirical results on a variety of text anonymization methods and attack configurations. The last section gathers the conclusions and depicts several lines of future work.

This paper is an extension of the preliminary research reported in a conference paper (Manzanares-Salor et al. 2022). The current paper more accurately formalizes the attack and the associated risk metric, and extensively discusses the design and implementation of the leveraged neural model. Moreover, we report a much more comprehensive set of experiments. As a result, Sects. 2 and 5 are entirely new. Section 4 has been extended with additional details and a figure. Furthermore, all the experiments and results reported in Sect. 6 are new.

2 Background

2.1 The anonymization task

Data anonymization is defined as the complete and irreversible removal from a dataset of all the information that could directly or indirectly re-identify the individuals to whom the data refer (Mackey et al. 2016). Re-identifying attributes can be classified into identifiers or quasi-identifiers. Identifiers are unique and publicly known values of an individual (such as their complete name or passport number) and, therefore, enable re-identification in isolation. Quasi-identifiers are also publicly known attributes (such as the individual’s gender, zip code or birth date) that, even though do not enable re-identification when considered in isolation, may do so when combined with other quasi-identifiers appearing in the same dataset. The quasi-identifiers that can be known on a particular individual constitute the background knowledge that can be exploited by attackers to re-identify the subject. Whereas identifying attributes are limited, any publicly known information on the individual may act as quasi-identifier and, therefore, quasi-identifying attributes are considered to be unbounded (Hundepool et al. 2012).

Anonymization, as it is defined in the GDPR, must therefore remove all identifiers and either remove or mask quasi-identifiers to truly prevent re-identification. Removing just identifying attributes should not qualify as anonymization (because it does effectively prevent re-identification), and goes under the term of de-identification (Lison et al. 2021; Chevrier et al. 2019).

The anonymization task varies significantly depending on the type of data. In structured databases, individuals are described as records, each on containing a fixed (and usually reduced) set of attributes. This structure makes it relatively straightforward for a human expert to classify such attributes into identifiers and quasi-identifiers. Anonymization methods are then applied on the basis of this attribute classification, and their goal is to produce the masking that best preserves the utility of the data while enforcing a certain level of privacy protection. Masking consists of suppressing attribute values or replacing them with generalizations.

Database anonymization is usually evaluated in the SDC literature via record linkage attacks (Domingo-Ferrer and Torra 2003; Nin Guerrero et al. 2007; Torra et al. 2006; Torra and Stokes 2012; Abril et al. 2012, 2015). Record linkage seeks to re-identify protected records by linking the masked quasi-identifying attribute values in the anonymized database with those in the background knowledge, which contains the original quasi-identifying values of known individuals. If the linkage is successful, the protected record would be unequivocally associated with the identity of the corresponding individual. Therefore, the re-identification risk of an anonymized database is defined as the proportion of correctly linked records:detected by the system:

$$\begin{array}{*{20}c} {{\text{Re - }}identification\;risk = \frac{{\# Correctly\;Linked\;{\text{Re}}cords}}{{\# {\text{Re}}cords}}} \\ \end{array}$$
(1)

Because quasi-identifying attributes in the anonymized and background databases unequivocally overlap, record linkage can be as straightforward as linking each anonymized record to the record in the background knowledge with the most similar quasi-identifying attribute values (Hundepool et al. 2012).

The goal of text anonymization is also to preclude re-identification of the individuals described into the text (e.g., the biographee in a biography text or the patient in a medical report).Footnote 1 However, in unstructured text data, (quasi-)identifiers are not explicitly defined. In fact, almost every word appearing in a text describing the individual to be protected may act as a quasi-identifier (Batet and Sánchez 2018). On this basis, the main challenge of unstructured text anonymization is the detection of text spans that may act as (quasi-)identifiers (e.g., demographic data on the individuals, spatial–temporal information, etc.), rather than the choice of the masking strategy (Lison et al. 2021; Csányi et al. 2021). Whereas masking consists on suppressing or replacing the detected (quasi-)identifying text spans by less detailed information (e.g., “New York” → “city”, “April 6th” → “April”), the detection of (quasi-)identifiers is performed manually in many practical applications (Bier et al. 2009). This makes the latter a costly process that usually involves several human experts, it is prone to errors and omissions and it is often approached as a de-identification -rather than an actual anonymization- task (Lison et al. 2021; Pilán et al. 2022).

To alleviate this burden, several automatic methods to detect quasi-identifiers have been proposed. As introduced in the previous section, NLP-oriented methods (Meystre et al. 2010; Aberdeen et al. 2010; Chen et al. 2019; Dernoncourt et al. 2017; Elazar and Goldberg 2018; Hassan et al. 2018; Huang et al. 2020; Johnson et al. 2020; Liu et al. 2017; Mamede et al. 2016; Neamatullah et al. 2008; Reddy and Knight 2016; Szarvas et al. 2007; Xu et al. 2019; Yang and Garibaldi 2015; Yogarajan et al. 2018) rely on sequence labeling to detect text spans belonging to pre-defined categories that may enable re-identification (typically named entities, such as names or addresses). Detection is based on handcrafted rules or machine learning models trained to identify the occurrence of those specific categories. After that, the detected entities are either removed or masked by replacing them by their categories. Because the types of entities that can be supported by NER models are limited (whereas quasi-identifiers are unbounded), the type of protection offered by these methods is closer to de-identification than to anonymization (Lison et al. 2021).

On the other hand, PPDP methods operate with an explicit account of disclosure risk and anonymize documents by enforcing a privacy model. As a result, they do not limit the types of information that may act as (quasi-)identifier. However, many of these methods are only applicable in restricted domains (Chakaravarthy et al. 2008; Cumby and Ghani 2011; Anandan et al. 2012), suffer from scalability issues due to their dependency on external resources (such as web search engines) (Sánchez and Batet 2016, 2017; Staddon et al. 2007), or produce distorted word distributions, rather than actual documents, thereby severely hampering the readability and utility of the anonymized outcomes (Mosallanezhad et al. 2019; Fernandes et al. 2019). An exception to the aforementioned methods is (Hassan et al. 2021), which proposes a practical PPDP-oriented approach that does not rely on external resources and retains the structure of the document. For this, it leverages the Word2Vec word embeddings model (Mikolov et al. 2013) to efficiently detect text spans that are semantically close to the individual to be protected. More specifically, the model is trained from scratch on the documents to be protected and (optionally) on some public in-domain text. For each training sample derived from a protected document, the corresponding individual name is prepended. This leads to words closely associated with the individual (i.e., identifiers and quasi-identifiers) having embeddings similar to that of the individual’s name. On this basis, words in a protected document with a similarity to the individual’s name above a predefined threshold are detected as (quasi-)identifiers. These disclosive words are then subjected to masking via ontology-based generalizations.

2.2 Language models

Nowadays, NLP models are rarely built from scratch. Instead, it is common to employ pre-trained language models and fine-tune them to the specific task to be solved. Language models are statistical models grounded on neural networks that assign probabilities to sequences of words and constitute the state-of-the-art in many NLP tasks, such as translation, text generation, sentiment analysis, summarization, question answering, or text classification (Devlin et al. 2019; Raffel et al. 2020; Brown et al. 2020; Bommasani et al. 2021).

Language models are typically based on the Transformer architecture (Vaswani et al. 2017), which was introduced as an encoder-decoder architecture for language translation, but has since become the main building block for generic language models. Transformers make it possible to efficiently derive contextualized vector representations (contextual embeddings) for each token (i.e., word or word fragment) of a given sequence. Those embeddings are computed through multiple neural layers, each being represented by its own set of parameters. Compared to previous neural architectures based on recurrent layers (Hochreiter and Schmidhuber 1997; Cho et al. 2014), Transformer models can handle long-range dependencies between tokens while simultaneously allowing tokens to be processed in parallel. Consequently, Transformers are better handling longer and more complex texts.

A language model is initially pre-trained with a large corpus of documents (e.g., Wikipedia and the BookCorpus (Zhu et al. 2015)) (Devlin et al. 2019; Mikolov et al. 2013), thereby capturing a broad range of text types and linguistic constructions. During pre-training, the model parameters are optimized in an unsupervised fashion from tasks such as masking a randomly selected word and subsequently predicting its value (masked language modelling), predicting the next word in a sequence (causal language modelling) or determining whether two sentences follow each other in a document (next sentence prediction).

Once a general language model is pre-trained, it can be subsequently adapted to a particular task (such as document classification) through a fine-tuning process consisting of a supervised learning step on a task-specific corpus. This learning process is a standard neural network training, where the entire training dataset is iterated over multiple times (epochs), and multiple training samples (batches) are processed in each iteration. Thanks to the general linguistic and factual knowledge acquired during pre-training, satisfactory results can be achieved even if the amount of labeled data available for fine-tuning is small. In addition, results can be improved if, previously to fine-tuning, the language model is additionally pre-trained using the task data. This process, which is conducted with the same training method employed to general pre-training, adapts the model to the specific task domain.

In recent years, the HuggingFace’s Transformers library (Wolf et al. 2020) has emerged as the standard free platform for the implementation, pre-training, fine-tuning, usage and sharing of (pre-trained) language models. One of the most popular and well-established transformer-based language models available are BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al. 2019) and its variations, which either improve its performance (Liu et al. 2019) or reduce the model’s size and, therefore, its computational requirements (Sanh et al. 2019). After fine-tuning, BERT and its variations have demonstrated human-level performance and competitiveness with larger language models (Sun et al. 2023) in multiple language-related tasks, including text classification. All these language models are freely available at Hugging Face’s transformers library (Wolf et al. 2020) with pre-trained weights and an easy-to-use implementation.

3 Related work

The vast majority of text anonymization methods proposed in the literature do not offer formal privacy guarantees (Pilán et al. 2022). Consequently, the level of privacy protection attained should be evaluated ex post, that is, through empirical evaluations of the re-identification risk of the protected outcomes. The objective is to determine whether the anonymization meets the privacy requirements set by the data holder.

Because IR-based metrics such as precision and recall constitute the standard approach to evaluate many sequence labelling tasks (and NER in particular), and NER models are the most common methods employed to tackle text anonymization, works in the literature rely on the recall metric to evaluate the attained privacy (Meystre et al. 2010; Aberdeen et al. 2010; Dernoncourt et al. 2017; Hassan et al. 2018; Johnson et al. 2020; Liu et al. 2017; Mamede et al. 2016; Neamatullah et al. 2008; Szarvas et al. 2007; Yang and Garibaldi 2015; Sánchez and Batet 2016). Recall is an IR-based completeness metric defined as the proportion of relevant elements that were correctly detected by the system:

$${{Re}} call = \frac{\# TruePositives}{{\# TruePositives + \# FalseNegatives}}$$
(2)

where #TruePositives is the number of relevant elements detected by the system and #FalseNegatives is the number of missed ones with respect to a ground truth. In text anonymization, these relevant elements are the (quasi-)identifying text spans that should be detected (and subsequently masked), and the ground truth consists of manual annotations of (quasi-)identifiers performed by human experts.

However, employing recall for privacy evaluation has several drawbacks (Pilán et al. 2022; Mozes and Kleinberg 2021). Specifically, recall does not assess the residual re-identification risk of anonymized texts, but only compares the anonymized outcomes against manual annotations. These annotations are subjective, and may be prone to errors and omissions (Lison et al. 2021; Pilán et al. 2022). Moreover, manual annotation is time consuming, and necessarily involves several human experts, whose annotations should be integrated through a non-trivial and non-univocal process (Pilán et al. 2022). Another drawback of IR-based metrics is that they assume the existence of a single gold standard annotation. Even though this assumption may be reasonable for NER or other NLP-oriented tasks, it does not hold for text anonymization, where several masking choices (each one encompassing a different combination of quasi-identifying terms) could be equally valid to prevent re-identification (Lison et al. 2021). Finally, recall-based evaluation assumes that all identified/missed elements contribute equally to mitigate/increase re-identification risk, which is rarely the case (Pilán et al. 2022). Indeed, failing to mask an identifier (such as the full name of an individual) is much more disclosive than just missing the birthdate or city name.

4 Re-identification attack and disclosure risk metric for anonymized text

In the following we present TRIA, a Text Re-Identification Attack for (anonymized) texts based on large neural language models, and TRIR, a Text Re-Identification Risk metric based on TRIA’s accuracy. Both aim to provide a practical solution to evaluate the actual privacy protection offered by anonymization methods for textual data. Compared to recall, our approach brings the following advantageous features:

  1. 1.

    Self-supervision: TRIR does not require from costly manual annotations as anonymization ground truth, which can be imperfect, potentially biased (if coming from a single annotator) or inconsistent (if coming from multiple annotators). Once the protected and background documents are set, our metric is fully automatic. Supervised learning only requires the label indicating the individual to be protected for each document.

  2. 2.

    Focus on risk: Contrary to recall, our metric directly measures the empirical re-identification risk of (protected) documents as a whole, rather than assessing the coverage of individual annotations.

  3. 3.

    Flexibility: Our metric can be tailored to take into account different resources available to attackers (see Sect. 4.2) thereby allowing to configure either ideal or more realistic attackers.

TRIA tries to re-identify the subjects referred in a collection of anonymized texts by leveraging a multiclass classifier, with one class for each individual known by the attacker. A set of identified and publicly available texts is employed for training the classifier. These texts encompass a population of known subjects in which the individuals referred in the anonymized set are expected to belong to. For instance, to re-identify anonymized medical reports from a city hospital, the attacker may use publicly available social networks posts from the residents of that city. The set of known individuals should therefore be a superset of the anonymized subjects. The anonymized documents would contain masked quasi-identifiers (e.g., age intervals) and confidential attributes (e.g., diagnoses) from unidentified individuals, whereas the public texts would contain identifiers (e.g., names and surnames) and clear quasi-identifiers (e.g., specific ages) from known individuals. Deficiencies in anonymization would enable unequivocal matchings of the (quasi-)identifiers of both types of documents, allowing to re-identify the subjects in the protected documents and, subsequently, disclose their confidential information.

According to the above, our approach can be seen as an adaptation of the record linkage attack (designed for structured databases) to textual data, where words or text spans correspond to attribute values, documents correspond to records, and the classifier performs the linkage between the anonymized and public data.

4.1 Formalization

Let AD be the set of anonymized (i.e., non-identified) documents and BD the set of identified public documents (i.e., background knowledge). Each document refers to a specific subject, thereby defining the sets of individuals AI and BI, and the mapping functions FA: AD → AI and FB: BD → BI.

As in the original record linkage attack, we assume that AI ⊆ BI, which means that a re-identification function FC: AD → BI exists, which matches protected documents with the corresponding known individuals. From the attacker’s point of view, AD, BD, BI and FB are known, and AI, FA and FC are unknown. Consequently, the goal of the attack is to exploit the similarities between AD and BD to obtain FC, an approximation of FC. Figure 1 depicts a graphical representation of this setting.

Fig. 1
figure 1

TRIA scenario, including individuals and documents sets from the point of view of the attacker

Using a medical example above, the anonymized documents (AD) might be patient medical reports, and the individuals to be protected (AI) would be those patients. The identified documents (BD) are publicly available documents associated with the individuals’ identities, such as posts of such patients about themselves in their social network accounts. The individuals linked to the public documents (BI) are known to the attacker (social networks are non-anonymous and publicly accessible), while the patients (AI) and their connection to specific medical reports –which correspond to FC, the re-identification function–, are initially unknown.

Our proposal is formalized in Algorithm 1, which outputs the number of correct re-identifications achieved by TRIA for a collection of anonymized documents.

Algorithm 1
figure a

Re-identification risk assessment algorithm for anonymized text documents

First, a multiclass classifier is built and trained to predict FC (line 1, see Sect. 5.4 for details). Following the formal notation above, the classifier would implement FC by learning which individuals from BI correspond to the documents in BD based on the knowledge (i.e., data sources) available to the attacker. Afterwards, the classifier is tested on the anonymized set of documents AD (line 4, refer to Sect. 5.5 or details). A correct re-identification would happen if the prediction (i.e., FC) matches FC (lines 5–6). The total number of re-identifications is finally returned in line 9.

As in the record linkage method (Eq. 1), we assess the TRIR of AD as the accuracy of TRIA:

$$TRIR = \frac{{num{\text{Re}} Ids}}{{\left| {A_{I} } \right|}}$$
(3)

where numReIds is the number of correct re-identifications achieved by Algorithm 1, and |AI| is the number of protected individuals from the AI set.

4.2 Attacker’s resources

One of our goals for TRIA is to be flexible enough to encompass both ideal or more realistic attackers. This means considering the resources (i.e., background knowledge and available computational resources) that attackers may (ideally or realistically) devote to the attack. This is in line with the GDPR,Footnote 2 which states that, to assess the residual risk of anonymized data, one should account the reasonable means that can be employed to achieve re-identifications.

We next characterize and discuss the resources that attackers may use to conduct re-identification attacks, and that should be considered when building TRIA’s classifier:

  1. 1.

    Background knowledge breadth: Corresponds to the number and type of individuals (and corresponding documents) considered as candidates for re-identification or, in other words, to the size of the population to which the individuals to be protected belong. Formally, this is the BI set (and corresponding BD) that the attacker determines as superset of AI. It is important to note that the AI ⊆ BI assumption may not hold in practice because AI would be usually unknown to the attacker. In this case, the re-identification risk is limited by the proportion of overlap between AI and BI. Furthermore, the difficulty of the re-identification task would depend on the number of individuals and the intra-class similarity of both AI and BI. On the one hand, increasing the size of BI would complicate the re-identification, since there are more individuals (classes) to distinguish. This behavior is inherent to any multi-class classification task. In addition, the more individuals in BI not present in AI the more prone to false positives the classifier would be; this increases the cost of the attack (because more classes must be learned) with no expected benefits. On this basis, the goal of an attacker would be to define a BI set with a size as close as possible to that of AI while trying to fulfill AI ⊆ BI. On the other hand, a high intra-similarity between the individuals in BI not present in AI does not affect the accuracy of the attack since there are no instances for these individuals in AD. Finally, regarding the intra-similarity between the individuals in BI also present in AI, and those not present in AI, the lower the better, since we do not want the classifier to get confused by individuals outside AI.

  2. 2.

    Background knowledge depth: Encompasses the quantity and quality of the background information available to attackers for each individual or, in other words, the (quasi-)identifiers available in each document of BD. Generally, the more information the attacker has for an individual, the easier the re-identification will be. However, if the individual in BI is not represented in AI (i.e., a background knowledge breadth issue), the information will probably render irrelevant and misleading for the classifier.

  3. 3.

    Computational capabilities: Processing power available to attackers influences their ability to perform inferences. Greater computational capacity would enable them using more complex models to build the classifier and/or more costly training strategies (see next section), potentially increasing the success rate of the attack.

In the next section we discuss the choices that attackers may make to balance inference accuracy and computational cost. Moreover, the experiments we report in Sect. 6 are defined so that the influence of these resources in the TRIA’s accuracy can be observed in practice.

5 Choosing and implementing the classifier model

The classifier model is a fundamental element of our approach because it determines the accuracy of the attack according to the resources available to execute it. Even though the design of TRIA is general and may be enforced by means of any machine learning model capable of classifying documents, in this section we present our specific design choices and implementation details adopted for the build_classifier and predict methods (lines 1 and 4 of Algorithm 1).

5.1 Selecting the language model

Our goal is to reproduce as realistically as possible the approach that an attacker may employ for re-identification. Under this premise, we expect attackers to take advantage of pre-trained language models, which have been introduced in Sect. 2.2, as state-of-the-art for NLP. The idea is to exploit the general understanding of text brought by language models devise a re-identification attack by additionally pre-training and fine-tuning those models with the background knowledge available.

On this basis, we consider BERT (Devlin et al. 2019) or its variations (Liu et al. 2019; Sanh et al. 2019; Jiao et al. 2020; Wang et al. 2020; Rasmy et al. 2021) as reasonable language models candidates for the attack. Table 1 compares widely used language models grounded on BERT in terms of size and pre-training data, which we will later use in our empirical experiments. In particular, DistilBERT (Sanh et al. 2019) is a distilled version of the original BERT which reduces a 40% the model’s size while keeping a 97% of its performance in multiple tasks, while RoBERTa (Liu et al. 2019) is a replication study of BERT pre-training, with a careful search of hyperparameters and trained with ten times more data (including also BERT’s training set). Due to its larger size, RoBERTa tends to provide better results than BERT in multiple tasks. BERT-based models can obtain human-level results with neither a huge cost nor unfeasible knowledge assumptions from the attacker. They can also be easily run locally, thereby eliminating the need to outsource potentially sensitive documents to a remote server for processing. Furthermore, these models are well-established and publicly accessible, with numerous domain-specific fine-tuned versions available.Footnote 3

Table 1 Comparison of widely-used BERT-based language models

The following subsections explain our approach for using a BERT-based language model for re-identification. This approach can be applied to any BERT-based language model with negligible code modifications thanks to the HuggingFace’s Transformers library (Wolf et al. 2020) we employ, which is the de facto standard library for the implementation, pre-training, fine-tuning and usage of language models.

5.2 From language model to classifier model

To adapt a BERT-based language model for a classification task, we apply the standard procedure described in Devlin et al. (2019) already implemented in the Transformers library.Footnote 4 It consists of appending a fully connected classification layer to the language model. This layer takes as input the output vector corresponding to a special classification token named [CLS] that is prepended to the input sequence (Devlin et al. 2019). The classification layer then produces a logits vector with one value per class (which, in our case, corresponds to an individual from BI). By applying the normalization SoftMax activation to the logits, the final probability for each class is obtained. As this new layer is designed to identify the individual from BI related to the document, we will refer to it as the individual classification layer.

Transformer-based models such as BERT have a limited input length. In the case of BERT (and its variations) this limit is 512 tokens, which is shorter than the length of many real-world documents. To address this limitation, we divide each document into acceptable-length splits and process each split using the BERT-based classifier. Details of the splitting criterion are provided in the next section. During training, the BERT-based classifier is fine-tuned to classify document splits individually (see Sect. 5.4). For prediction, we aggregate the classification output of each split to obtain the final prediction. More details on this are provided in Sect. 5.5.

5.3 From documents to model input

For each input document we perform the three subsequent steps:

  1. 1.

    Pre-processing: This step is done to unify semantically equivalent text spans and remove non-meaningful ones. Concretely, we perform lemmatization (e.g., “changing” replaced by “change”) and remove special characters (e.g., the “#” of a hashtag) and stop words (e.g., “the” or “that”). This can be done with any NLP library such as spaCy.Footnote 5 On the one hand, lemmatization gets rid of morphological variations of words, making the classification less dependent on the text syntax and more focused on the ((quasi-)identifying) facts conveyed by the document. On the other hand, the removal of special characters and stop words brings three benefits. First, it discards non-meaningful (and, therefore, non-identifying) information that may confuse the model. Second, since the classifiers’ input length is limited, removing those elements increases the density of quasi-identifiers at each split/input, which should be beneficial for re-identification. Finally, it reduces the number of splits (explained at the sliding window step) generated per document that, in turn, decreases the training and prediction runtimes.

  2. 2.

    Tokenization: Transforms the sequence of words of the input text into a sequence of tokens, as required for a language model. For BERT-based models, the standard WordPiece subword segmentation algorithm (which BERT relies upon (Devlin et al. 2019)) is employed by leveraging the implementation available in the Transformers’ library.Footnote 6

  3. 3.

    Sliding window: To accommodate BERT-based models’ input length limitations (512 tokens), tokenized documents sometimes need to be divided into smaller splits. Inspired by Pappagari et al. (2019), we split documents by using a sliding window with a length of N tokens (being N smaller or equal than the model’s limit), and an overlap of M tokens with the previous window (equivalent to a stride of N-M). The overlap enables the model to never lose the context of consecutive tokens, that is, it ensures that the model can observe the previous and following M tokens for every token (except the document’s first and last) in at least one split. Setting an appropriate N and M is crucial, since they determine how long the detected long-term dependencies can be (which affects the model’s accuracy), and the number of text splits generated (that influences training and prediction runtimes, which depend on the number of inputs rather than their length). These N and M hyperparameters can be empirically adjusted through hyperparameters search, as we detail in Sect. 5.6.

5.4 Building the classifier

By leveraging the steps outlined in the previous subsections, we now detail the implementation of the build_classifier method from Algorithm 1. As shown in the algorithm, it receives as input the background knowledge documents (BD), the corresponding individuals (BI) and the mapping function (FB) that connects both. The output of the method is a classifier trained for document re-identification. In Fig. 2, a diagram of the workflow for this method is depicted, which we explain below.

Fig. 2
figure 2

Workflow for the build_classifier method from Algorithm 1

As introduced in Sect. 2.2, adapting language models to specific tasks often involves additional pre-training and/or fine-tuning steps on task data. Following (Sun et al. 2019), we apply both steps to the language model for obtaining a corpus-specific BERT classifier adapted to the domain of the documents to re-identify. For both steps, the procedures defined in Sect. 5.3 are applied to all documents identically with the exception of sliding window splitting.

In the additional pre-training step, the model is further optimized on a masked language modelling (MLM) objective. This the standard pre-training task in which BERT was initially trained on, and implementationsFootnote 7 and guidesFootnote 8 are provided in the Transformer’s library. In MLM, a vocabulary classification layer is appended to the language model, and all the parameters of this extended model are optimized. At the end, the vocabulary classification layer is removed, and we preserve the additionally pre-trained language model. For this step, increasing the sliding window length and/or overlap (NP and MP in Fig. 2) should be beneficial, so that the context available for each token is maximized.

For the fine-tuning step, the language model resulting from the additional pre-training step is further optimized on labelled data,Footnote 9 which in our case corresponds to the identity of the individuals referred in the background documents. More specifically, an individual classification layer is appended to the language model (see Sect. 5.2), and the resulting classifier model is trained with a cross-entropy lossFootnote 10 to predict the individual being referred to in each document split. For this step, increasing the length of the sliding window (NF in Fig. 2) may not be beneficial: whereas a longer window provides more information to assist the classification, it can also be too specific to the document. Moreover, for a fixed overlap (MF in Fig. 2) ratio, the longer the window, the fewer splits are created for each document, thus reducing the variability of the samples per individual. As a result, a too large window may lead to overfitting, and make the model focus on document-specific (sub) sentences rather than on more general (quasi-)identifying words or text spans. The same sliding window configuration selected for fine-tuning will be used at prediction (see Sect. 5.5).

Regarding the data employed for this two-step training, it should be stressed that the attackers’ knowledge is limited to BD, BI and FB. Documents in BD provide knowledge of the individuals’ specific vocabulary, which improves understanding of domain-specific words. Additionally, BD can be labeled on BI by using FB, thereby providing useful information about the relationship between the publicly available information and the individuals’ identity. This can lead to the detection of (quasi-)identifying attributes (e.g., the person’s name or demographic attributes), which form the basis of the re-identification attack.

An intuitive approach would be to perform additional pre-training using BD and fine-tuning with BD labeled on BI. This results in a corpus-specific model capable of mapping documents to BI, as it is needed for the attack. However, it is important to note that the goal of the attack is to correctly classify documents from AD, which stem from a different data distribution than BD documents. In particular, BD are clear texts (e.g., identified publications in social networks) whereas AD are anonymized documents (e.g., non-identified medical reports with quasi-identifying terms masked via suppression or generalization). This could impair the re-identification accuracy, since machine learning algorithms are sensitive to differences between training and test data distributions.

Our solution is to create BD’, an anonymized version of BD, by using any practical text anonymizer that attackers may have at their reach. Using the same method employed for AD would be the ideal but, since such method would be rarely known, a standard NER-based approach (given that NER is the most used and directly applicable technique for text anonymization) may be employed instead. In our experiments, we utilized spaCy NER for this purpose, which is one of the evaluated anonymization techniques (refer to Sect. 6.1). Because the resulting document set BD is more similar to AD, it will provide a better approximation of how data are (typically) anonymized. Moreover, BD can be labeled on BI (given that BD’ → BD is known), which facilitates discovering the identities associated to the masked documents; for example, by discovering identifying words neglected by the anonymization method (e.g., a specific location) that also appear in AD documents. Under these considerations, we propose using BD and BD for additional pre-training, and the union of BD and BD labeled on BI for fine-tuning, thereby obtaining a classifier model better adapted to the expected anonymized documents.

5.5 Prediction

The classifier resulting from the build_classifer method is employed in the predict method of Algorithm 1 to infer the individual from BI corresponding a (presumably anonymized) document. This final classification step is depicted in Fig. 3. First, the document is pre-processed, tokenized and split by using the same sliding window configuration employed for fine-tuning (see Sect. 5.4). Afterwards, each split is processed by the trained BERT-based classifier (one at a time) and the results are aggregated into a single global prediction using the element-wise sum of all the logits vectors. The distribution of class (in our case equivalent to individuals in BI) probabilities is obtained by applying a SoftMax function to those results. Finally, the class/individual with the maximum probability is chosen as the predicted individual for the document.

Fig. 3
figure 3

Our proposed workflow for the predict method

We favored the sum of logits instead of the most frequent prediction or the sum of the classes’ probabilities (i.e., logits after SoftMax) to account for the magnitude of the output. If the model is able to clearly identify an individual based on one text split, this prominent output would be reflected in the sum of logits. The motivation for this design is to mimic human reasoning for re-identification, which gives greater importance to parts of the text with clear re-identification clues (such as (part of) the person’s name left unmasked), than to those with less clear evidences. This is also in line with (Li et al. 2007; Mozes and Kleinberg 2021), which point out that missing a single identifying attribute during anonymization may be enough to reveal an individual’s identity.

5.6 Selection of hyperparameters

The classifier construction relies upon the definition of several hyperparameters common to neural networks training (i.e., number of epochs, learning rate and batch size, see Sect. 2.2) and TRIA-specific (i.e., sliding window length and sliding window overlap, see Sect. 5.2). To this end, we propose to empirically set this hyperparameters based on a separate development set, as it is standard in machine learning. That is, by defining a development set disjoint but similar to the evaluation set and performing multiple tries with different hyperparameter values in order to select the combination that provided the best results.

In our case, the evaluation set to be mimicked by the development set are the documents in AD labeled in BI (i.e., re-identification ground truth) and a try is the complete re-identification attack presented in Algorithm 1. We propose to define a fixed number of epochs for additional pre-training, and fine-tune it until a pre-defined maximum number of epochs is achieved or the development accuracy stops improving (early stopping). As development set, we propose to create a random subset from the documents in BD, which we call CD, and transform it to be as similar as possible as the data distribution of AD. For this, a straightforward approach would be to anonymize CD; nevertheless, this would result into identical texts to those in BD, which are already present in training data. Thereupon, a previous step is required aiming to differentiate CD texts from BD ones and, if possible, to assimilate them to those in AD prior anonymization. We propose to apply a summarization-like procedure on documents from CD, which would result in a set that we call ÇD. To this end, abstractive or hybrid summarization methods are preferred to extractive ones (El-Kassas et al. 2021), so the resulting summarizations do not include sentences identical to those present in BD. After summarization, documents in ÇD are anonymized (producing ÇD) as done for BD. The resulting ÇD is used as the development set. The specific hyperparameters employed in our experiments are detailed in Sect. 6.3.

6 Evaluation results

In the following we report empirical results on using TRIA and TRIR to evaluate several text anonymization methods on a common corpus of text documents. To offer a broad comparison, we assessed the residual re-identification risk of NLP-based, PPDP-oriented and manual anonymizations. Our experiments are configured to provide a comprehensive analysis of the influence of the resources available to attacker resources in the risk assessment. We also report a comparison against the standard recall metric based on ground truth human annotations. To ensure reproducibility, all the code and data are available on: https://github.com/BenetManzanaresSalor/TextRe-Identification

6.1 Evaluated methods

As stated in Sect. 2.1, NLP-based methods (Meystre et al. 2010; Aberdeen et al. 2010; Chen et al. 2019; Dernoncourt et al. 2017; Elazar and Goldberg 2018; Hassan et al. 2018; Huang et al. 2020; Johnson et al. 2020; Liu et al. 2017; Mamede et al. 2016; Neamatullah et al. 2008; Reddy and Knight 2016; Szarvas et al. 2007; Xu et al. 2019; Yang and Garibaldi 2015; Yogarajan et al. 2018) usually approach anonymization as a NER task, in which words belonging to allegedly re-identifying categories (e.g., locations, names, dates, etc.) are masked. In this case, masking consists of replacing the detected entities by their semantic categories. We evaluated the following tools for NER-based text anonymization (Lison et al. 2021):

  1. 1.

    Stanford NER (Manning et al. 2014): offers three pre-trained NER models: NER3, which is able to detect LOCATION, ORGANIZATION and PERSON types; NER4, which detects LOCATION, ORGANIZATION, PERSON and MISC (miscellaneous); and NER7, which detects ORGANIZATION, MONEY, DATE, PERCENT, PERSON and TIME.

  2. 2.

    Microsoft PresidioFootnote 11: an anonymization-specific tool based on NER. Among the diversity of entity types supported by Presidio, we enabled those corresponding to quasi-identifying information: PERSON, LOCATION, NRP -person’s nationality, religious or political group- and DATE_TIME.

  3. 3.

    spaCy NERFootnote 12: we employed the en_core_web_lgFootnote 13 neural model trained on the Ontonotes v5 corpus (Weischedel et al. 2011), which is capable of detecting named entities of DATE, CARDINAL, EVENT (e.g., wars), GPE (e.g., countries, cities), FAC (e.g., buildings), LANGUAGE, LAW (named documents made into laws), MONEY, LOC (non-GPE locations such as mountain ranges), NORP (nationalities or religious or political group), ORDINAL, PERCENT, ORG, PERSON, PRODUCT (e.g., vehicles), TIME (times smaller than a day), QUANTITY and WORK_OF_ART (e.g., books titles, songs) types.

For PPDP-grounded text anonymization methods, as discussed in Sect. 2.1, the only practical method we found is (Hassan et al. 2021), which is based on word embedding models. Hereafter, we will refer to this method as Word2Vec, as it is built upon this neural model. In comparison with the fixed and non-configurable anonymization of NER-based methods, Word2Vec stands out as the only method evaluated that can be tuned to adjust the trade-off between privacy and utility. This is controlled by a threshold t that determines the minimum embedding similarity required for a word to be masked. In our experiments, we employed two threshold values: t = 0.25, which was observed by the method's creators to maximize privacy protection, and a more relaxed t = 0.5, which would result in less masking and, therefore, weaker protection. For this method, we leveraged its implementation available in GitHub.Footnote 14

In addition to the methods listed above, we also considered the manual anonymization effort conducted in Hassan et al. (2021), which was based on sound privacy-oriented annotation guidelines. This enables us to evaluate the robustness of manual anonymization under TRIA, but also makes it possible to compute the recall of the above-described methods -by considering the manual anonymization as the ground truth-, and compare this metric against the risk assessment of TRIR.

Finally, we also report the TRIR for the non-anonymized versions of the documents in AD. This defines the baseline risk that anonymization methods should (substantially) reduce.

6.2 Evaluation corpus and background documents

The corpus described in Hassan et al. (2021) was employed as evaluation data. It consists of 18,672 Wikipedia articles corresponding to the “twentieth century actors” Wikipedia category. Each one provides biographical details of a movie actor. To simulate the scenario depicted in Fig. 1, we considered the article abstracts as the texts to be anonymized, and the article bodies, which provide extended details on the abstract’s information, as the identified publicly available documents on the individuals to be protected. A subset of 50 article abstracts corresponding to 50 contemporary, popular and English speaking actors was selected from this corpus (as done in Hassan et al. (2021)). We considered them the set of documents to be anonymized. In attack’s formal notation, the 50 extracted actors constitute AI, the 50 abstracts anonymized with a specific method m define ADm, and the article bodies in the corpus define BD with a population of BI actors.

As introduced in Sect. 4.2 TRIA (and subsequently TRIR) depends, among other attacker’s resources, on the background knowledge breadth; that is, the overlap between BI (the complete set of known individuals in the background knowledge) and AI (the set of individuals referred to in the anonymized documents, which are unknown to the attacker), and the relative size and similarity of those two sets. To evaluate the background knowledge breadth, we defined several scenarios by setting increasingly larger BDs and corresponding BIs with varying distributions:

  1. 1.

    50_eval: this is a worst-case scenario for privacy, in which BI exactly matches AI. Subsequently, BD comprises the 50 article bodies from Wikipedia corresponding to the 50 anonymized abstracts from those same articles. This scenario constitutes the easiest re-identification setting because there are no individuals in BI absent from AI and, therefore, defines an ideal attacker with respect to background breadth.

  2. 2.

    500_random: a scenario consisting of 500 article bodies randomly extracted from the full corpus plus those corresponding to the 50 actors in AI that were not included in the random selection (thereby ensuring that AI ⊆ BI). Compared to the 50_eval scenario, BI is larger, with most of its individuals absent from AI. Nevertheless, the similarity between the randomly selected individuals not present in AI and the popular English-speaking actors of AI is expected to be low, thereby facilitating re-identification. This scenario defines a more feasible but still powerful attack.

  3. 3.

    500_filtered: a more realistic scenario with 581 article bodies obtained by filtering the full dataset of 18,672 articles according to several attack-oriented criteria. The criteria aim at maximizing the number of individuals in AI present in BI (even without knowing AI, as it would happen in practice). In particular, the filtering discarded article bodies corresponding to non-native English speakers, dead individuals, non-actors (e.g., directors or producers), actors born before 1950 or after 1995 (latter included) and those whose article included less than 100 links and was available in less than 40 languages. The latter two criteria aim to capture the ‘popularity’ of the actor. As a result, only 40 out of the 50 actors in AI appeared in BI, thereby limiting re-identification accuracy to 80%. Higher intra-similarity than for the 500_random scenario is expected due to the tailored (i.e., non-random) filtering. This defines a more realistic scenario, because the attacker knows that AI individuals are popular, English-speaking and alive actors. The attack is considered a powerful yet credible threat.

  4. 4.

    2000_filtered: a more realistic scenario with 1,952 article bodies selected by using the same filtering as in 500_filtered but omitting the criterion related to the number of languages per article. As a result, 41 out of the 50 actors in AI appeared in BI, thereby limiting re-identification accuracy to 82%. This scenario is a superset of 500_filtered, with 1,371 new individuals added due to the less strict filtering, most of them not present in AI. The intra-similarity is expected to be slightly lower than between the 581 individuals of 500_filtered, but still significantly higher than for 500_random. This defines a weaker but probably more feasible attack, with a more realistic size for the population in the background knowledge.

As we latter report, for larger cardinalities of BD the training time of the model is likely to take longer than 12 h with the (reasonably powerful) hardware configuration we employed. Specifically, ten fine-tuning epochs using the whole 18,672 articles takes approximately 21 h. In such cases in which the number of background documents available for the attacker outweigh its computational capabilities to train and conduct an exhaustive attack (and provided that the attacker does not know AI), the strategy employed to define the 500_filtered and 2000_filtered scenarios would be the most reasonable choice. This reaffirms the fact that these two latter scenarios are more realistic than 50_eval and 500_random, which define worse-case (from a privacy perspective) but more ideal (from the perspective of the attacker) scenarios.

After defining BD for each scenario, the corresponding BD, CD, ÇD and ÇD sets needed to build the training and development sets should be created as detailed in Sect. 5.2. For BD, the documents in BD are anonymized using spaCy NER. After that, ÇD is defined as a subset of the abstracts corresponding to the bodies in BD. Since the abstracts are summaries of the article bodies, this procedure can be considered equivalent to the summarization-based approach proposed in Sect. 5.6, which means that CD does not need to be explicitly built. Finally, the same method employed for creating BD is applied to the documents in ÇD. This results in the ÇD set that constitutes the development set. The size of ÇD was set to 30% for 50_eval, and 10% for the 2000_filtered, 500_filtered and 500_random scenarios.

6.3 Testing environment and hyperparameters

To simulate the implementation of TRIA by potential attackers, we selected Google ColaboratoryFootnote 15 -a web-based IDE for interactive Python programming on the cloud- as execution environment. Our experiments were executed in a stable environment consisting of an Intel Xeon CPU, 12 GB of RAM and an Nvidia Tesla K80 GPU with 16 GB of VRAM.

As discussed in Sect. 5.1, a BERT-based classifier was employed to conduct the attack. We considered the popular pre-trained language models based on BERT outlined in Table 1. For the experiments varying the background knowledge, we opted for DistilBERTFootnote 16 (Sanh et al. 2019), because it offers an outstanding trade-off between accuracy and computational cost.

The remainder of our implementation was performed as described in Sects. 5.3, 5.4 and 5.5. Hyperparameters search was performed by using the re-identification accuracy at the development set as selection criteria, as outlined in Sect. 5.6. Ideally, this search should be performed independently for each BD defined in the previous subsection. Nevertheless, given the number of tests to be conducted, their runtime, and the similarities of the scenarios, we only applied it to the 50_eval scenario and then employed the obtained parameters for all the evaluated scenarios. Concretely, the hyperparameters that obtained the best accuracy at the development set were: learning rate = 5e−5 (for both additional pre-training and fine-tuning), batch size for additional pretraining = 8, batch size for fine-tuning = 16, sliding window length/overlap for additional pretraining = 512/128 and sliding window length/overlap for fine-tuning = 100/25. Training relied on an Adam optimizer with betas set to 0.9 and 0.999. The additional pre-training was performed for 3 epochs and the fine-tuning for a maximum of 20 epochs. Using early stopping with a patience of 5 epochs and the accuracy on the development set as criteria, fine-tuning was run for ~ 20 epochs for the 50_eval, 500_random and 500_filtered scenarios, and for ~ 10 epochs for the 2000_filtered scenario.

6.4 Influence of attacker’s resources

Hereafter we perform a comprehensive analysis of TRIA and TRIR according to the resources assumed to be available to attackers. According to the characterization of such resources introduced in Sect. 4.2, we considered:

  1. 1.

    Background knowledge breadth: By using the different background sets presented in Sect. 6.2, we assessed the role of the size and type of the population in the background knowledge available to attackers.

  2. 2.

    Background knowledge depth: We considered two different sizes for the background documents (i.e., article bodies): full bodies, which were used in most experiments, and the first 25% of the body contents, which were employed to analyze the influence of the background knowledge depth.

  3. 3.

    Computational capabilities: We evaluated how the complexity of the attacker’s inference affected TRIA’s accuracy and runtime. This involved using different document splitting strategies (e.g., sliding windows of different sizes), usage of anonymized BD (i.e., B’D), and varying the size of the base language model employed to build the classifier (i.e., DistilBERT, BERT or RoBERTa).

By configuring these aspects, we can set ideal (i.e., most powerful) or more feasible (but less powerful) attack scenarios.

6.4.1 Background knowledge breadth

Results depicted in Fig. 4 show the influence of the different background knowledge breadths (corresponding to the BDs described in Sect. 6.2) in the re-identification of the anonymized documents (AD) resulting from the methods introduced in Sect. 6.1. Notice that the “average” column in this and the forthcoming figures has been computed without considering the results for ADClear text.

Fig. 4
figure 4

TRIR using different background knowledge breadths for several anonymization approaches. Horizontal lines depict the upper bounds for 500_filtered and 2000_filtered

Results show that increasing background knowledge breadths significantly reduce the attack’s accuracy, which is in line with the discussion in Sect. 4.2. The only exception is the 500_random scenario that, despite having a BI set much larger than AI, resulted in just slightly less re-identification risk than 50_eval. This was caused by the low similarity between the randomly selected individuals and the 50 protected ones, which made the latter easily distinguishable within the random set. In comparison, the risk of the filtered BDs (500_filtered and 2000_filtered) was significantly lower because i) not all the protected subjects were present in BD, and ii) those present were closer to the other individuals in BD, thereby being more difficult to differentiate.

It is worth noting that the TRIR of ADClear text (i.e., non-anonymized documents) is very close to the maximum risk of each scenario, being 100% for 50_eval, 98% for 500_random and 74% for 500_filtered and 2000_filtered. This demonstrates the capability of the additionally pre-trained and fine-tuned DistilBERT model as classifier for the re-identification task. For anonymized documents, TRIA was able to re-identify individuals even from ADManual, with accuracies greater than the random guess, which were 2% for 50_eval, 0.2% for 500_random, 0.17% for 500_filtered and 0.05% for 2000_filtered. The fact that ADManual did not achieve a perfect protection (i.e., its re-identification accuracy was above random guess), highlights the susceptibility of human annotations to omissions and errors, as discussed in Sects. 1 and 3. This is caused by the difficulty of accounting for all the background knowledge and, therefore, all the (quasi-)identifying information (and their combinations) to be considered during manual anonymization.

Results for methods based on NER show weak protection, with re-identification risks greater than 45% for the 50_eval worst-case scenario and, still, greater than or equal to 20% for 2000_filtered. Results for ADNER7 are the worst in all cases, probably due to the lack of a LOCATION and MISC types, which encompass a large variety of quasi-identifying information. In contrast, the Word2Vec instantiations achieved the lowest risks among all automated methods and across all BDs. As expected, ADWord2Vec t=0.25 produced a better protection than ADWord2Vec t=0.5 thanks to its stricter threshold. The risks figures of ADWord2Vec t=0.25 are also quite similar to those obtained by the manual anonymization. The outstanding protection of this method is the result of not limiting masking to a pre-defined set of categories (as NER-based methods do).

The runtime of TRIA mainly depends on the classifier fine-tuning step, which in turn is influenced by the number of samples. On this basis, 20 epochs of fine-tuning on 50_eval took 30’ while for 500_random it required 1 h 40’. For 500_filtered, the same 20 epochs took 4 h 10’, 2.5 times more than for 500_random. This is in line with the length of the documents, which is 3 times larger for 500_filtered than for 500_random, because the popularity-related filters applied resulted in longer articles (i.e., popular actors have longer articles). Lastly, 10 epochs with 2000_filtered required 5 h.

6.4.2 Background knowledge depth

To analyze the influence of the background knowledge depth available to attackers, we created a reduced version of the BDs sets presented in Sect. 6.2 encompassing the first 25% of the content in the corresponding Wikipedia article bodies. Figure 5 depicts the results for all the anonymization methods with the depth-reduced background knowledge sets.

Fig. 5
figure 5

TRIR using the first 25% of the content from the background knowledge texts

Comparing these results with those in Fig. 4 we observe significantly lower risk figures, with values near zero for the manual and Word2Vec approach in some configurations. Differences were more pronounced for the largest background knowledge set (2000_filtered), which indicates that, for larger background populations (breadth), the amount of background knowledge (depth) available is crucial to differentiate individuals and enable accurate re-identifications.

As the overall risk has decreased, the gap between the different automatic anonymization methods also narrowed. Nonetheless, their relative rankings remain largely unchanged. An exception to this trend is seen for ADNER3 and ADNER4, which obtained risks comparable with those of ADWord2Vec t=0.5. This suggests that the quasi-identifiers missed by those NER-based methods (and exploited by TRIA) were mostly located beyond the first 25% of content of the background documents.

6.4.3 Computational capabilities

The most computational-intensive component of TRIA is building the classifier (see Sect. 5.4), particularly, during the fine-tuning step when the model is trained for re-identification. We next report experiments to evaluate how the different settings that can be tuned by attackers to improve training (and, therefore, improve inference quality at the cost of computation time) affect TRIA’s accuracy and runtime. To maximize the observable differences to changes to those settings, all tests were conducted on the 50_eval worst-case scenario, which is the least affected by the size of BI and its divergence with AI.

Figure 6 illustrates the effect of the splitting technique (detailed in Sect. 5.3) during the fine-tuning that, as stated in Sect. 5.5, is the same used for prediction. Sliding-window N/M defines a window of length N and an overlap M with the previous window. On the other hand, sentence-based stands for one split per sentence, an alternative splitting technique that produced splits with an average length of 23.5 tokens.

Fig. 6
figure 6

Comparison between different split sizes at fine-tuning for the 50_eval scenario

As shown on Fig. 6, there is a general trend whereby smaller splits sizes produce more accurate results. This is in line with the hypothesis made in Sect. 5.4, which associated longer window lengths with overfitting. During the fine-tuning step, using long splits may result in samples too specific to the publicly available sources of the individual (i.e., the article’s body in BD and its anonymized version in BD), while reducing the number of samples from which learn the class (since, in general, the longer the split, the fewer the number of splits available). Subsequently, the model is likely to overfit to the training documents rather than learning generalizable (quasi-)identifiers that facilitate re-identification. This makes it preferable to use shorter splits, but these would increase fine-tuning runtime. In fact, fine-tuning time scales almost linearly to the number of splits and, therefore, runtimes are significantly larger for smaller split sizes (i.e., 15,873 splits and 2 h 30’ for sentence-based, 4,217 splits and 30’ for 100/25 sliding-window, and 2,088 and 15’ for 200/50 sliding-window). For this reason, we used the sliding window 100/25 configuration in all the other tests as a well-balanced trade-off between fine-tuning time and accuracy.

Figure 7 presents an ablation study in which the classifier is only trained with the raw bodies BD or the anonymized bodies BD, instead of the combination of both sets (as we proposed in Sect. 5.4). These ablations almost halve the training data (and consequent runtime) of the additional pre-training and fine-tuning steps. Figure 7 shows that, in most cases, important benefits are obtained by using BD instead of BD, and that their combination further improves the results. The improvements are considered worth the increase of training time (30’ instead of 20’). This is in line with the arguments given in Sect. 5.4, probably with BD helping the model to deal with anonymized documents, and BD providing the information corresponding to the masking replacements. Specifically, documents in BD allow the model to adapt to the data distribution of anonymized documents, which differs from that of BD due to the masking replacements. Moreover, they allow the model to discover text spans neglected by the anonymization method that are also present in documents from AD. When BD and BD are combined, the model can learn useful information on how documents have been anonymized, thereby facilitating re-identification.

Fig. 7
figure 7

Ablation study on the inclusion of anonymized bodies in training for the 50_eval scenario

Finally, Fig. 8 illustrates the effect of the neural language model employed for the re-identification. We compare results obtained with DistilBERT (distilbert-base-uncased, with ~ 67 M parameters) with those obtained by using the standard BERT model (bert-base-uncased, with ~ 110 M parameters), and RoBERTa (roberta-base, with ~ 124 M parameters). Notice that the combination of the narrowest knowledge breadth (50_eval), deepest background knowledge (complete article bodies) and largest model (RoBERTa) defines the most unconstrained and powerful attacker configuration considered so far. Fine-tuning runtimes in this experiment were 30’ for DistilBERT, 1 h for BERT and 1 h 15’ for RoBERTa. The accuracies averages show that, in comparison with DistilBERT, the improvement provided by BERT is small, whereas RoBERTa performed slightly worse for automated methods. The minor improvement of BERT is in line with the negligible loss of accuracy (~ 3%) observed by the authors of DistilBERT in other tasks (Sanh et al. 2019). The relatively poor performance of RoBERTa is, on the other hand, surprising. One possible explanation might come from the training data employed for its pre-training. All BERT, DistilBERT and RoBERTa use the Wikipedia as training data, which therefore matches our evaluation dataset; but, whereas for BERT and DistilBERT this represents 75% of the training data, for RoBERTa it is just 7.5%. Exceptionally, the re-identification accuracy for ADManual seems to significantly benefit from the (usually) better-performing models (i.e., RoBERTa > BERT > DistilBERT). This may be motivated by the stricter manual anonymization, which requires re-identification to focus on learning the few highly specific (quasi-)identifiers that may remain in the anonymized documents. In that case, re-identification depends more on the amount of knowledge available to build the model, rather than on Wikipedia-specific knowledge.

Fig. 8
figure 8

Effect of different language models on TRIR

6.5 Comparison with recall and human annotations

As mentioned in Sect. 3, the goal of empirical risk assessment is evaluating whether the protection level achieved by the anonymized documents fulfils the privacy requirements of the data holder. The recall metric (Eq. 2) employed in the literature assesses risk by comparing the automatic anonymization with manual annotations, which are considered the anonymization ground truth. Consequently, it evaluates a single (and presumably strict) privacy requirement encompassed by that manual ground truth. In contrast, TRIA (and, therefore, TRIR) can be configured for evaluating distinct privacy requirements according to the assumptions made on the attacker’s resources. As shown in the previous section, by calibrating those resources, one may simulate ideal/powerful attackers that would define a worst-case scenario for privacy, or more feasible (and less powerful) attackers with more realistic access to data sources and computational capabilities.

In the following, we compare recall and TRIR under different attacker configurations with varying background knowledge breadths and depths that, as shown in the previous section, are the factors with the greatest influence in the risk assessment. To compute the recall for each method, we used its standard implementationFootnote 17 and the manual annotations from Hassan et al. (2021) as ground truth. Given that recall quantifies the inverse of the disclosure risk, we leveraged its complementary (100%—Recall) to enable a direct comparison with TRIR. The comparison was done in terms of the Pearson correlation between both metrics for the different anonymization methods considered. Results are reported in Table 2.

Table 2 Correlation between recall and TRIR with different attacker’s resources for all the anonymization methods

Correlation figures vary substantially, with the highest value corresponding to the most powerful attacker configuration (50_eval background breadth and 100% background knowledge depth). This is consistent with the manual annotations conducted in Hassan et al. (2021), which were based on sound privacy-oriented annotation guidelines and that, as shown in the figures reported in the previous section, resulted in the strongest anonymization. Inversely, the lowest correlation corresponded to one of the weakest (but also more feasible) attacker configurations.

Figure 9 depicts a detailed comparison between recall and TRIR for the most and least correlated attacker configurations. We can see that, even though recall was strongly correlated with TRIR’s most powerful attack scenario, it also systematically underestimated the observed re-identification risk. This is the result of recall relying on (imperfect) manual annotations, which exhibited a non-negligible re-identification risk (10%) under this attacker configuration. Differences with respect to the least correlated configuration were larger, which shows the inability of single annotations to encompass different privacy requirements or attack scenarios. These results showcase TRIR as an unsupervised alternative to recall, which is free from its dependencies (on human annotations) and that is more flexible to encompass varying privacy requirements.

Fig. 9
figure 9

Comparison between the standard recall metric and the most and least correlated TRIR’s attack configurations

As a complement to the former results, and to better understand the limitations of manual annotations, we also measured the precision of the automatic anonymization methods. Precision computes the percentage of automatically masked text spans that were also annotated by humans, and it is used in the literature as an estimate of the utility preserved by anonymized documents. A low precision indicates that terms were unnecessarily masked, thereby hampering the utility of the anonymized documents. Figure 10 reports precision values for the different methods using the same implementation as for recall.

Fig. 10
figure 10

Precision of automatic anonymization methods

Consistently with the previous privacy assessments, there are notable differences between NER-based approaches and the Word2Vec-based method. On one hand, the higher precisions of NER-based methods derive from the fact that most of the (limited) NE types they are able to detect actually correspond to (quasi-)identifying information. Decrease of precision among these methods is also related to the diversity of named entity types each method encompasses, with broader coverage resulting in lower precision. Because this broader coverage not only did not provide better privacy, but resulted in the highest risk (see NER7, Presidio and spaCy in Fig. 9), we can conclude that more extensive NER coverage does not necessarily lead to better anonymization.

On the other hand, the Word2Vec-based method resulted in the lowest precision. However, the fact that both Word2Vec instantiations obtained nearly identical precision but significantly different recall/TRIR, makes for an interesting analysis. First, this indicates that most of the identifying text spans detected in ADWord2Vec t=0.25 but not in ADWord2Vec t=0.5 were also detected by humans in ADManual, thereby increasing recall. Those are the text spans whose embeddings obtained a similarity in the range (0.25, 0.5] with the individual’s name, probably corresponding to slightly disclosive quasi-identifiers. Subsequently, the 67% of text spans detected in ADWord2Vec t=0.5 that were not detected by humans in ADManual, obtained similarities in the range (0.5, 1] that, because their closeness to the individual’s identity, are expected to correspond to highly disclosive quasi-identifiers. If the Word2Vec criterion is minimally correct (which it certainly seems to be because of its strong protection), this would imply that the human annotators missed some disclosive text spans. This observation aligns with the non-negligible risk measured for the manual anonymization in most attack configurations, and underscores the humans’ difficulties to detect non-obvious (quasi-)identifiers that can be leveraged by attackers to conduct re-identifications.

7 Conclusions and future work

We have proposed two privacy evaluation tools for text anonymization: a Text Re-Identification Attack (TRIA) and an associated Text Re-identification Risk (TRIR) metric. Compared to the recall-based risk assessment ubiquitously employed in the text anonymization literature, our approach provides an unsupervised alternative that does not rely on costly manual annotations, and that is flexible enough to encompass a wide vary of privacy requirements/attack configurations. Moreover, whereas recall is a completeness measure that has been used as a proxy of true risk assessment, our metric measures privacy in the same manner it would be threatened: performing re-identification attacks by exhaustively exploiting the available background knowledge. Empirical results highlighted the limitations of recall due to its dependency on (imperfect) human annotations. This is consistent with limitations of the use of absolute recall values as a measure of protection/residual risk highlighted in recent studies (Lison et al. 2021; Pilán et al. 2022; Mozes and Kleinberg 2021). Moreover, we provided additional empirical evidence to the criticisms raised in Lison et al. (2021); Hassan et al. 2021) on the drawbacks of using NER methods for text anonymization.

As future work we plan to explore additional attack-oriented techniques, such as leveraging the confidence of the classifier predictions in order to improve re-identification precision. We also plan to leverage explainability techniques (Lundberg and Lee 2017) on TRIA to identify the terms that greatest contributed to correctly classify/re-identify documents. These can then be used as feedback to improve the anonymization, thereby creating a virtuous anonymization cycle by which the risk assessment contributes to a gradual and systematic reduction of the privacy risk. We also plan to design a utility measure in line with TRIR and alternative to the precision metric commonly employed in the literature. As with recall, IR-based precision suffers from several limitations when employed to measure the degree of utility preserved in the anonymized outcomes, as it as roughly equates the information loss resulting from anonymization to the number of false positives with respect to manual annotations (Pilán et al. 2022). Our idea is to propose a metric that, by also leveraging state-of-the-art language models, accurately measures the information loss resulting from each masking operation.