1 Introduction
Ensuring individuals’ identity protection and sensitive information is crucial when sharing data to comply with the principles of different data protection regulations. The General Data Protection Regulation (GDPR), for example, was established to guarantee that all companies that use or collect data of European citizens maintain their identities protected [
8]. To fulfill this requirement, the legislation recommends using anonymization or pseudonymization techniques.
Anonymization is a process in which personal data is irreversibly changed to obtain privacy [
2]. Various anonymization techniques, such as generalization, suppression, and slicing, are applied to achieve this goal. On the other hand, pseudonymization does not modify the personal data. Instead, it replaces the personal identifier with a pseudonym [
3]. This pseudonym is associated with personal data and can be created using pseudonymization techniques like counter, encryption, and between others.
Preserving individual privacy is paramount in many scenarios, such as in healthcare where data contains sensitive personal information. When this kind of data needs to be shared or analyzed, it must comply with regulations to safeguard the identities of individuals. To achieve this, we can use some pseudonymization and anonymization techniques, but it remains challenging to identify and apply the most suitable methods. The choice of the appropriate techniques often depends on the project’s particularities regarding aim, scope, data privacy, or data utility prioritization, among other factors. Therefore, it is important to understand which techniques can be used in each context, especially to prioritize data privacy and utility.
Identifying and documenting patterns can bridge the gap in determining the most appropriate strategy for the context of each project. This report presents the results in terms of validation and pattern mining from the conduction of a focus group [
7] during the EuroPLoP 2024 conference. We sought to achieve three main results: get feedback on the patterns that we have previously described and on the pattern map we outlined [
5,
6], and identify patterns related to anonymization or pseudonymization of data that have not yet been documented.
The remainder of this report is organized as follows: Section
2 presents the methods performed during this research; Section
3 presents the results of the focus group; Section
5, discuss our findings, limitations and future works; and Section
6 summarizes the main findings of this research and discusses research directions.
2 Methods
Our study follows the method presented by Kontio et al. [
4], which consists of the following steps:
Defining the research problem,
Selecting the participants, and
Planning and conducting the focus group session. The steps are detailed below.
2.1 Defining the research problem
Our specific aim is to provide insights regarding the following three research questions:
•
To what extent do our documented patterns address recurring problems related to the anonymization of datasets?
•
To what extent does the previously designed pattern map show the main patterns related to the anonymization and pseudonymization of datasets?
•
What techniques that can address recurring problems related to data anonymization and pseudonymization have not yet been documented as patterns?
2.2 Selecting the participants
This focus group was held as part of the EuroPLoP 2024 conference [
1]. The conference provided both participants for the activity and a slot in the program to carry it out. During the event, we announced the session, outlining its theme and objectives, and inviting conference attendees to join. In total, five participants took part in the session, in addition to the focus group moderators.
2.3 Planning and conducting the focus group session
The session is designed to have a duration of one hour and thirty minutes and to include at least three participants. The session begins with each participant introducing themselves and with the moderators introducing the context in which they needed to guarantee data privacy. This step is important to understand the context of each expert.
Next, the research team briefly presents the concepts of anonymization and pseudonymization. After the initial presentation, we ask whether the participants have used any anonymization or pseudonymization techniques in their professional practice. This question can help put into context the rest of the information gathered from the participants.
We then ask the participants to report which anonymization or pseudonymization techniques they have already used or heard about during their professional practice. This question aims to identify solutions to possible patterns used by the participants.
After gathering the cited techniques, we present and explain our pattern map and, together, attempt to match the cited techniques with the categories on the pattern map. Finally, we try to match the technique with our own patterns.
3 Results
We found that two participants used anonymization to provide (and/or receive) test data. One participant stated that she understood the concepts but thought she had never used them. However, we noticed that this participant had already used manual pseudonymization techniques to hide IDs in documents before making them available. The other two participants had heard about anonymization and pseudonymization but never used any related technique.
We also found that participants had already used or heard about the following techniques in their professional practice: Hide personal data, leave out the data from the dataset, deterministic pseudonym generation, shuffle attributes values, synthetic data keeps the attributes of real data, don’t store the data, add noise by adding attributes.
We presented our pattern map after collecting the aforementioned techniques, featuring a collection of patterns and their primary connections (
cf. Figure
1). The pattern map is divided into three major categories: Anonymization, Pseudonymization, and Perturbation.
Within the anonymization category, we included ten patterns. Furthermore, two subcategories represent the suppression technique (Suppression) and handling of outliers in the datasets (Handle Outliers). The anonymization category includes the Suppress Identifiers pattern, which can be used as an alternative to the pseudonymization category. The pseudonymization category contained five patterns. The other four pseudonymization patterns can support the Pseudonymize IDs pattern, because they are used to generate pseudonyms.
Finally, the Perturbation category is introduced as an alternative to anonymization. Instead of anonymizing the datasets, perturbation can be used to achieve privacy by adding noise to the data. These two categories complemented each other. This category includes two patterns that describe the two practices used in perturbation. We believe that this category can have more patterns to document in relation to the noise techniques used to achieve the two models: global and local differential privacy.
After presenting the pattern map, we attempt to include the cited techniques within our pattern map categories. The result is shown in Figure
2. The only technique that we could not categorize was
Don´t store the data because we think this is a preventive technique for data privacy not related to anonymization or pseudonymization.
Finally, to understand whether our documented patterns address recurring problems, we attempted to match the techniques with our patterns. We found that the techniques Hide personal data and Leave out the data from the dataset are equivalent to our patterns Suppress ID and Suppress QID.
4 Limitations
Some factors may have influenced the results of this study, limiting the validity of our conclusions. In particular, the small sample size of five participants recruited within the same community is very likely not fully representative of the range of professional experiences with anonymization and pseudonymization techniques. Validity can be limited by participants’ varied understanding of the key terms. We mitigated this possibility by clarifying concepts at the beginning and during the session.
5 Discussion
The focus group results highlight varying levels of familiarity and experience with anonymization and pseudonymization techniques among the participants, revealing both practical applications and gaps in understanding. The pattern map provided valuable insights, particularly in showing how suppression and pseudonymization patterns overlap or complement each other. For instance, suppression techniques, such as Suppress Identifiers can serve as alternatives to pseudonymization, suggesting a fluid boundary between the two categories.
The pattern map categories accommodated most of the techniques elicited by the participants. This shows that although we have space for documenting and accommodating new patterns, the pattern map encompasses the most important categories of anonymization and pseudonymization patterns.
We also found that one-third of the techniques cited by participants were presented in our pattern map with different names. These results suggest that even though the literature documents some practices, further exploration is required to fully capture the range of privacy solutions used by practitioners.
6 Conclusion
We conducted a focus group with 5 participants during EuroPLoP 2024 to answer three research. The focus group results highlight varying levels of familiarity and experience with anonymization and pseudonymization techniques among the participants, revealing both practical applications and gaps in understanding.
Participants mentioned various techniques used in their professional practice and helped us in evaluating our documented patterns and pattern map. The answers to the research question are presented below.
•
To what extent do our documented patterns address recurring problems related to the anonymization of datasets? We found that one-third of the techniques cited by the participants were documented in our pattern map. This result indicates that our pattern mining technique successfully found recurring problems related to the anonymization of datasets.
•
To what extent does the previously designed pattern map show the main patterns related to the anonymization and pseudonymization of datasets? The focus group results showed that we successfully designed the most important pattern categories related to the anonymization and pseudonymization of datasets. However, further exploration is required to fully capture the range of privacy solutions used in practice.
•
What techniques that can address recurring problems related to data anonymization and pseudonymization have not yet been documented as patterns? We found four candidate techniques to be documented as patterns: deterministic pseudonym generation, shuffle attributes values, synthetic data keeps the attributes of real data, and add noise by adding attributes.
In future work, we plan to explore the techniques identified in this focus group session to document them as patterns. Additionally, we plan to hold new focus group sessions with a broader audience to identify other important techniques used in practice and the main recurring problems they resolve.
Acknowledgments
We would like to thank the participants in our focus group at EuroPLoP 2024 for attending the session and helping us think and discuss this topic—Tim Wellhausen, Thomas Majestrick, Luciane Adolfo, Diogo Maia, and Francisca Almeida.
This work is co-financed by Component 5 - Capitalization and Business Innovation, integrated in the Resilience Dimension of the Recovery and Resilience Plan within the scope of the Recovery and Resilience Mechanism (MRR) of the European Union (EU), framed in the Next Generation EU, for the period 2021 - 2026, within project HfPT, with reference 41.