1. Introduction
The inherent subjectivity surrounding the definition of privacy poses significant challenges in defining and evaluating it within the contemporary digital landscape. Despite widespread privacy breaches impacting users’ lives, many individuals remain unaware of the associated risks using data-centric, personalized services and applications. Traditional privacy actions, such as closing a door or talking to a limited group, are inapplicable in the uncontrolled digital environment. In addition, varying levels of computational literacy among individuals further complicate the issue of private interactions in the digital world. The digital world is plagued with interactions with uninformed consent, opaque data flows, inadequate user interfaces for comprehending privacy risks, complex privacy policies, and even outright neglect for user privacy. Legal frameworks and regulation are essential to create a minimum level of protection, but they alone cannot guarantee individual privacy or be adapted to the inherent subjectivity of privacy.
User-centric design approaches encompass both legal and operational aspects to provide individuals with more control over their privacy. However, while software applications are presented to users as respecting their privacy, integrating robust privacy considerations into Software Development Life Cycle (SDLC) remains a significant challenge. To address this gap, a privacy-centric software-development methodology, denoted DevPrivOps [
1], focuses on using Privacy Quantification (PQ) frameworks for robust privacy compliance throughout the lifecycle. DevPrivOps considers local and distributed software tests, also accommodating the risk of data combinations that result in the observation of private data from non-private data. This can result from the interaction between distributed services and user behavior.
A PQ model allows constant monitoring of the services’ level of privacy. It involves processing inputs (e.g., software, systems, or services), assessing privacy indicators, computing privacy levels using mathematical metrics or formal models, and producing a privacy scale (e.g., a color, number, or symbol). Both software developers and individuals benefit from the quantification outcome, as it enables them to make informed privacy decisions.
Consequently, there is a critical need for more robust methods to assess privacy levels. Beyond the subjective nature of PQ, the vast diversity of data across multi-services and contexts necessitates a standardized definition of data categories. In short, developing robust PQ requires a well-structured definition of data categories, enabling the classification of each data point associated with an entity.
A static data categorization (e.g., non-personal or personal) is not enough to accommodate data diversity and the different possibilities of obtaining data including the transformations applied to data that either increase or decrease its sensitivity. In addition, it brings a certain ambiguity [
2] when considering data from multiple data sources, processed data (e.g., inferred data) or other data formats (e.g., metadata). In addition, understanding user preferences and external influences is also essential. Individuals naturally categorize their personal information digitally [
3], but these categorizations vary widely.
Recognizing the need for a broader set of data categories, the possibility of Personally Identifiable Information (PII) inference from non-PII data during service monitoring, and the need to prioritize user preferences (privacy expectations and external influences) in categorization, we define a hierarchical, multidimensional data-categorization framework that places users and their interactions at its core. Data can be statically defined, or dynamic categories can be assigned to data from users, to cope with novel environments. The categorization considers its application in decentralized architectures and multiple data sources where PII can be inferred.
The proposed dynamic and user-centric data categorization, denoted by Privacy-sensitive Data Categorization (PsDC), is based on existing privacy literature, is supported by analysis of real datasets, and considers legal frameworks and privacy standard groups. The primary focus is on providing a dynamic data categorization that represents user behavior, preferences, and different data type representations, with applicability to enhance novel, data-centric services for the upcoming generation of telecommunication architectures [
4].
The categorization scheme put forth herein is intended to serve as a cornerstone within our evolving privacy-quantification framework. This facet of our research, while still in its developmental stages, is poised for dissemination in the immediate future. The privacy-quantification model itself is designed to evaluate the privacy level of continuous data streams, a process predicated upon the aforementioned categorical divisions, observed user behavioral patterns, and the dynamically shifting context of both execution and data application. Furthermore, we are actively constructing an innovative, Machine-Learning (ML)-based approach to automate the classification of data in accordance with our defined categories, integrating both stream and semantic feature analysis. This undertaking is informed, though notably expanded upon, by the foundational concepts explored in [
5]
The rest of the paper is organized as follows:
Section 2 presents the current background and most relevant state-of-the-art in terms of privacy and privacy-based data categories;
Section 3 describes the proposed data-categorization framework named PsDC, and its applicability is explored in
Section 4; finally, the main conclusions are in
Section 5.
2. Background on Privacy and Data Categories
The constant evolution of technology and the changes in society make the interpretation of the term “privacy” complex and the word cannot be simplified with a simple definition. The high subjectivity of privacy arises from the different notions of what is private and what is public that coexist in society. Subjectivity also arises from how familiar people are with the relevant discussions and the knowledge that subjects may have regarding the potential value and risks of sharing some data. Thus, privacy encompasses what is appropriate for public view, which can be influenced by several external factors, including cultural aspects. In addition, privacy extends beyond physical spaces and experiences to encompass digital information [
6].
The immense potential of personal data collection for various sectors brings privacy concerns, since individual data extend far beyond their personal use. In an attempt to protect individuals’ privacy, regulations such as General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) aim to achieve a balance between technology innovation and individual control. Fortunately, across the world we are assisting in the adoption and updating of such regulations [
7].
Generally, these regulations empower individuals with the adequate rights to protect their data, such as the rights to access, rectify, and delete their information, ensuring transparency and user consent in data collection. Unfortunately, we continue to witness privacy breaches, making it imperative to combine what is known as soft privacy from legal regulations and hard privacy from technical approaches [
8].
Concentrating on the hard privacy concept and considering a data-driven world, every interaction on digital platforms generates data that identify individuals, potentially revealing intimate details about their lives if misused. This data can range from sensitive financial information to personal health records, as well as political, religious or sexual orientation. Digital privacy centers on the responsible handling and use of privacy-sensitive data within digital environments. These private data refer to personal information, communication, and online behavior. The idea behind digital privacy is to protect individuals’ rights and expectations to maintain the confidentiality and security of their personal information in the digital realm.
Another dimension of the problem is related to the combination of data and the inference of private information. Revealing knowledge from anonymized personal information unlocks another dimension of its value. By analyzing vast datasets, researchers and businesses can gain valuable insights into consumer behavior, market trends, and social phenomena. This knowledge fuels innovation in areas such as personalized medicine, targeted advertising, and social network analysis. However, ensuring ethical and responsible practices is crucial, guaranteeing user privacy and adherence to data-protection regulations. Moreover, even anonymized data can be vulnerable to privacy breaches through techniques such as clustering analysis, which is a powerful technique for organizing unlabeled data into meaningful groups.
2.2. Understanding Data Privacy Categorization
The current vulnerabilities affecting personal data make it imperative to comprehend the value of data, to evaluate privacy-protection systems and treat data more adequately and effectively.
Different approaches are being proposed in the literature to understand the privacy value of some data. Some of these approaches focus on developing taxonomies, ontologies, and vocabularies that evaluate data types and their relationships based on data categories.
Different ontologies are being proposed providing a formalized knowledge representation. It acts as a shared vocabulary for entities, concepts, and their relationships within a specific domain. This unified model goes beyond simply assigning attributes to data sources. These ontologies provide a structured representation of privacy requirements, helping to differentiate them from security requirements and prevent privacy violations [
9,
10]. Privacy ontologies are crucial tools when addressing privacy concerns in various domains. The majority of the authors aim at proposing privacy ontologies to ensure legal compliance, to simplify the process of understanding privacy policies, or to analyze the usage of data processors that often neglect or even misuse personal data [
11].
However, these works are limited to specific scenarios or they have a legal basis with more generic perspectives. Gharib et al. [
11] reviewed the literature and identified that while several studies emphasize the importance of integrating privacy considerations throughout the system-design process, a lack of a comprehensive ontology that captures all the main privacy concepts and relationships persists. Trying to address this gap, the authors proposed a novel privacy ontology that incorporates the core privacy concepts and relations, considering the following dimensions: organizational, risk, treatment, privacy. In their ontology, they referred data as personal or public. They do not explore the concept of PII beyond personal or non-personal data within a privacy ontology.
A key challenge remains: How do we better define personal data from a technical perspective? A new level of dynamism is required to better understand the level of privacy required for different types or categories of data. Essentially, to protect digital privacy in current systems, we argue for the need to focus on dynamic data categorization, which can handle various data formats, storage methods, and processing techniques, including data inference or derivation. The risk of data inference arises when seemingly benign data elements, whether analyzed independently or through data fusion, enable the derivation of privacy-sensitive information pertaining to an entity.
By assigning privacy categories to the extent of data collected from various sources and types, data can be handled appropriately. Proper personal data categorization enhances transparency, risk assessment, and decision-making regarding appropriate measures, adjusting the required protection level, reducing data collection to only what is necessary for the purpose, and adjusting retention periods. Furthermore, privacy-based categories are essential to an agile PQ model applied in the SDLC, as demonstrated in the DevPrivOps methodology.
These benefits of a data-categorization method have generated widespread interest in the literature for exploring potential categories across different application domains (see
Table 1).
Some authors focused on specific scenarios, such as Sun and Xu [
12] in social network scenarios, but developed data-categorization models that can be generalized to multiple scenarios. One such example is the work of Oh [
3], where the focus was on how people categorize information files. The study identified three different types of personal digital information categorizers based on three types of mind structures: rigid categorizers, fuzzy categorizers, and flexible categorizers. To gather data, a questionnaire-based approach was conducted with each of 18 participants.
Focusing on autonomous vehicles, Mlada et al. [
15] evaluated the riskiness of personal data based on four general criteria: the nature of the controller, public availability of data, scope/amount of data, and use of data processors. The result of this evaluation affects the calculation and magnitude of the final value, denoted by
R. The main criteria used to determine this value are the evaluation areas, possible outcomes, and an additional evaluation for individual evaluation coefficients. The resulting value is then categorized into four different types of evaluation.
In the field of data classification, Rumbold and Pierscionek [
16] focused on big data scenarios and classified data into six different categories, and assigned them a sensitivity scale by proposing a spectrum. However, some studies prefer to generalize the categorization model rather than focusing on specific scenarios. Milne et al. [
13] proposed a classification system for personal information based on the perceived risk level associated with different data types. Researchers identified four types of risks—monetary, social, physical, and psychological—and grouped data into these categories. They also collected data that were aggregated into six clusters, allowing them to consider direct and indirect data. Moreover, Oh and Belkin [
14] focused on data that are difficult to categorize, such as ambiguous or anomalous information. Ambiguous information can be placed into multiple categories, while anomalous information does not fit into the existing organizational structure.
The studies presented in this context focus on not providing a static evaluation, concentrating on more dynamic and broad data interpretations, to understand whether it is possible to infer personal data from diverse types of other data.
Oh and Belkin [
14] proposed a detailed study aimed at tackling the challenge of categorizing difficult data. Although the study did not cover all the issues related to this problem, it provided a valuable research direction to explore further. Furthermore, other studies focused on proposing a classification framework while considering various parameters to refine the categorization results.
When considering data categorization, it is important to assess how the proposed approaches were tested by the different studies mentioned. The majority of these studies focused on conducting interviews with real people to understand their preferences. This approach aligns with the concept of privacy, as individuals have the right to decide what information should be protected and what should not.
In addition, there is a lack of practical tools to categorize data. PII-Codex [
17], a significant advance in the implementation of a privacy-based categorization framework, is a comprehensive tool designed for the detection, categorization, and severity assessment of PII. It considers extensive theoretical, conceptual, and policy research on PII categorization and severity assessment, and is integrated with PII detection software.
However, the current literature lacks analytical studies that can provide a deeper understanding of the impact of data categorization.
3. Proposed Privacy-Sensitive Data Categorization
After a careful review of a representative set of data-categorization approaches, we found value in evolving the existing categories, by considering a broader data-categorization framework denoted by PsDC focusing on technical details. In this section, we describe the strategy of this data-categorization approach, and describe the clearly identified data items that are relevant for pervasive, data centric services. We also aim to present a structure that can be enhanced with other data types, with the capability to evolve and to be fitted to novel scenarios and applications.
To better understand how to properly categorize data based on privacy aspects, we manually analyzed the features of several datasets and selected 11 representative datasets containing data that potentially reveal individuals’ private information. By referring to existing data categories from the literature, legal regulations, and standards, we found that we could not accommodate all possibilities for data categorization. This is due to the potential for data inference or derivation, even when there is no apparent linkability in multiple datasets. For this reason, we selected a set of existing data categories from existing literature, regulations, and standards, and organized them according to generically addressing needs of datasets to properly categorize data based on private aspects.
Moreover, PsDC puts the entity data owner at the center of the process, and their interest in classifying data as a mean to both understand impact and manage risk. Note that the entity can be an individual, a group of people (e.g., family) or a device (e.g., car). Focusing on user-centric approaches, we consider the entity in this case an individual as a user of the current online permanent services. We consider two main categories as possible data categorization of the user: all data that directly identify a user and its data from behavioral attitudes that include data from social interactions. Moreover, we add a dynamic view by considering data tags and data attributes. This novel approach provides PsDC with an engineering dimension to analyze data and understand their relationships, including the derivation of private data from other data.
The PsDC framework utilizes a hierarchical structure for data categorization with two primary categories. Data that directly identify individuals such as name, age, or usernames belong to the broader identifier category. This category subdivides into different subcategories including human, biological, and devices. Following the idea of the tree structure, each of them also subdivides into more specific data categories. Thus, we capture a high dimension of data that directly identify individuals; the human category refers to data that directly identify an individual from itself; intuitively biological refers to data from biological sources such as DNA, and devices that provide that from direct usable devices from the user.
The other primary category pertains to behavior analysis, which reflects individual actions and interactions. This category encompasses data from social and online interactions, transactional behavior (including shopping patterns or legal activities), digital footprints (such as social media posts), and actions that can be habitual or singular. This category is crucial for capturing data that, while not directly identifying individuals, can still compromise their privacy. Rather than analyzing behavior data and then attempting to categorize them as direct data, we proactively identify behavior data as a broader category to categorize data as they are obtained.
The “identifier” category draws from recognized standards and a thorough review of existing literature. The “behavioral” category, in contrast, offers a novel perspective on data classification, recognizing that behavioral data is not easily captured by conventional, static categorization. Furthermore, we have incorporated tags and attributes to provide a dynamic viewpoint and enhanced control over the data-classification process.
Each category delves further into sub-levels, with the final level representing the most specific data classifications within the hierarchy (see
Table 2). This final level offers the most granular detail and allows for a wider range of possible values associated with each data type.
Data itself is not the only factor influencing the potential to identify an individual. We argue for the necessity to integrate dynamic evaluations, to achieve a deeper understanding of the relationship between data points, and to facilitate the detection of potential inferences of personal data. For this reason, the framework also considers data tags (see
Table 3) and data attributes (see
Table 4). Tags provide an extra dimension to data categorization, allowing for the identification of specific contexts, for instance, health-related or sensor-generated data. Attributes, conversely, delineate data characteristics such as their format or stage of development.
Regarding data tags, our proposed data-categorization framework considers dynamic tags to categorize data. These tags can be directly assigned to data, resulting as a data categorization itself or complement the categorization provided by the first described tree. Data tags resulted from the diverse ways to obtain data, data representations, and interpretations.
Data tags encompasses data from health sources that can be medical information or wellbeing that consider health data, skills, payment types, frequency, sensing, spatial, temporal, logical, or statistics.
The medical tag encompasses anything related to an individual’s diagnostics, public disorders such as a pandemic situation, or symptoms. Wellbeing tags encompass data that can be from non-medical sources; however, they are related to health aspects such as blood pressure captured by a sensing device typically denoted as well-being data.
Furthermore, temporal and spatial data consider the high inference possibilities for private data from non-private data. The different representations of temporal and spatial data are not addressed in the remaining categorizations. In addition, there is a critical need to understand these representations even more during pattern recognition, by considering a huge set of data representations for each one.
Data attributes further refine the categorization process, enabling a deeper understanding of data characteristics and their potential to reveal personal information. This includes aspects such as data storage format (organization, ciphered, or plaintext), maturity, metadata, direct or indirect data, and the presence of URIs or descriptive texts.
These attributes provide an additional layer of data categorization, giving an extra understanding of data “value”. Plaintext or encrypted significantly impacts observability and inference possibilities. Plaintext data are readily accessible for analysis, whereas encrypted data require decryption before analysis. Similarly, anonymized data offer a certain level of protection, but they still carry associated privacy risks, such as the risk of re-identification.
Furthermore, metadata associated with data can be categorized based on their origin (social, technical, operational, or business). This metadata categorization provides an additional layer of information that can contribute to inferences about personal data. These formats further refine data categorization and highlight potential privacy concerns associated with different data types.
Finally, another attribute that should be highlighted is the string. While the remaining categories, tags, and attributes refer to the data themselves, this attribute subdivides into three subattributes that can be processed to obtain more data and enable subsequent categorization. A simple example is when considering the name “Catarina”, which is categorized as “Identifier/Human/Subject”. However, if the dataset field contains “My name is Catarina”, we can apply the categorization with the attribute “String/Description_Text” since it also reveals private data, allowing us to then identify the name within the string.
Descriptions for all categories are detailed in
Table 2 for the most specific categories of the broad categories “identifier” and “behavior”;
Table 3 for the data tags for the categorization model; and
Table 4 for the data attributes for the categorization model.
Despite the proposed advancements in data privacy categorization in our proposed framework, we found value in improving the approach, adding one more perspective: individual’s privacy expectations and privacy influences. When considering individuals as the entity in the data generator in the proposed framework, consider that privacy preferences and influences are crucial due to the high subjectivity associated with the privacy meaning and decisions.
Privacy preferences and influence properties within PsDC provide our framework with high dynamism through the use of different weights for different data categories. Privacy is inherently subjective, with users possessing diverse notions of what constitutes personal data. PsDC ensures that individuals can define their own privacy preferences, including expectations and influence.
This user-centric approach allows individuals to assign weights to different data categories, reflecting their personal perception of a privacy violation (perceived risk). For instance, a user might consider biometric data to be more sensitive than browsing history, and would assign a higher weight to the “biometric data” category within the framework. These weights can then be used to dynamically calculate the overall privacy level associated with data access or processing activities.
Finally, the PsDC, presented as a simplified data-categorization model, is structurally hierarchical rather than a single mathematical function. However, we are developing two functional models: one for automated data classification using PsDC, employing semantic and stream features to cluster and categorize data, particularly effective in IoT contexts; and another for privacy quantification, which utilizes PsDC categories, sensitivity levels, and behavioral data to estimate privacy risks within data streams. These ongoing developments aim to provide a more dynamic and quantifiable approach to data privacy.
4. Applying PsDC in DevPrivOps Use Cases
DevPrivOps [
1] is a novel methodology for software development that puts privacy as a primary requirement and not an afterthought. To achieve this, it is necessary to develop sufficient tools that help developers to produce privacy-enhanced applications, but also means the users track and understand how their data are gathered, processed, and used.
PsDC can be understood as a machine-readable and human-understandable file containing mandatory privacy fields, such as data type, tags, attributes, associated risk, and personalized fields to include user preferences. This file can serve as input for PQ, evaluating whether the software meets the required privacy standards.
Recognizing the inherently personal nature of privacy, we refrain from assigning fixed sensitivity levels to categories. Instead, we offer a framework, the PsDC, that empowers users or privacy specialists to define the appropriate sensitivity level for each category. We are, in fact, exploring the use of ML models to estimate the sensitivity of each category based on a user’s historical behavior. However, this aspect of our work is currently in its developmental stages.
As detailed in
Section 2, the proposed categories were designed to serve as inputs for the privacy quantification component. As previously mentioned (in
Section 1), an automated categorization paradigm, founded upon ML principles, is currently undergoing active development. This endeavor draws inspiration from the seminal work [
5], and integrates both stream and semantic feature analysis to establish a sophisticated metric for data stream similarity. By leveraging this nuanced similarity assessment, we aim to cluster previously uncategorized data within our curated and labeled data repository. Subsequently, category and label propagation from the established data streams to the novel ones is facilitated. The aforementioned inspirational work demonstrates promising outcomes in the context of IoT stream categorization. We anticipate analogous results within our domain, given the shared theoretical underpinnings of the respective models.
It is crucial to acknowledge that while several PeTs exist, with differential privacy being the most prominent, their applicability is often limited. These methods primarily focus on minimizing fingerprinting and fall short when considering broader concerns such as profiling and indirect inference. Research by [
18,
19,
20] has demonstrated that, despite its potential to degrade ML accuracy, the effectiveness of differential privacy is highly dependent on the specific dataset and model. In fact, some models exhibit improved accuracy when trained on anonymized data compared to the raw dataset. Another significant study by [
21] highlights the substantial computational cost associated with differential privacy, particularly for large datasets, often outweighing its privacy-enhancing capabilities.
Furthermore, with the rapid advancements in Large Language Models (LLM), and their remarkable ability to dissect and navigate complex, multi-dimensional datasets, the potential for their use in PII inference becomes a tangible concern. While presently, these models are primarily optimized for coherent text generation within a request-response framework, the fundamental architecture of LLMs, often built upon variations of the transformer layer, could theoretically be adapted for PII inference given access to diverse data sources. The current unknown lies in the scale of training and the necessity for meticulously curated datasets to ensure a high degree of accuracy in this specific application. While it is conceivable to leverage the internal workings of LLMs to implement our categorization method, this approach presents several challenges: firstly, the computational resources and dataset sizes needed for effective performance remain uncertain; secondly, the suitability of pre-trained models for this specific application is questionable; and thirdly, our current resources do not permit such extensive exploration. Furthermore, our developing categorization model uniquely combines stream-based features with textual analysis. This approach offers a distinct analytical perspective while requiring significantly less computational power.
Considering these observations, it is reasonable to conclude that privacy categorization remains an emerging field with immense potential to significantly enhance privacy for the average user.
This approach contributes to various aspects of privacy-conscious software development, including user-centric design interfaces, enhancing the possibility of informed consent for data collection, providing users with greater control over their personal data, autonomous mechanisms for data-breach detection, and finally, considering non-human users that collect individual data, such as smart devices.
4.2. Network Data Communication with PsDC
PsDC applies to current network topologies where data are processed and analyzed closer to the data sources. Appropriately classifying data at the network edge, data processors can prioritize security measures for the most sensitive data categories. Thus, there is a higher risk of privacy breaches providing stronger protection mitigating potential vulnerabilities. Conversely, low-risk data can benefit from more relaxed security measures, optimizing resource allocation within the smart home system.
Moreover, because edge data are often diverse and can be combined with external sources, it is essential to comprehend the evolving privacy implications. A dynamic approach enables continuous assessment of data sensitivity at the edge, understanding the privacy requirements for data inference or combinations and implementing appropriate security controls. For instance, with network flow streams, log analysis and data categorization based on various sensitivity levels become crucial to determine when PII is being obtained from non-critical data. By detecting such situations, software can be adapted to avoid this issue, ensuring a high level of compliance and minimizing legal risks. PsDC assists in this by offering a wide range of different PII categories in a structured hierarchy that places the user at the center of the categorization. The additional dynamic data types and subtypes introduce a dynamic nature to the categorization.
Continuous monitoring advocates for a dynamic approach to categorizing data to identify when personal data might be inferred from seemingly low-risk data. Systems should be equipped with a module to dynamically capture network traffic, analyze it, categorize data based on PsDC, and then send the completed machine-readable file to the component responsible for quantifying privacy.
A concrete scenario where PsDC can be applied is in smart environments, including smart cities, homes, or factories. PsDC can be utilized when integrating a new device into an existing network of interconnected smart devices. Based on the device’s function, the data it collects should be categorized according to PsDC categories. The device should then only be enabled in cases where user privacy is not compromised, leveraging the analysis of the PQ model’s output.
In addition, the categories of data processed by the smart device help select the most appropriate device. In cases where a device has multiple functions but we only need one, for example, we can use privacy-based categories to identify and restrict data storage to only the required data.
In this context, PsDC will be applied during the onboarding of network services within the scope of a European project named RIGOUROUS (for more information, see
https://rigourous.eu/. In this project, we are considering the integration of a machine-readable and human-understandable file as a privacy manifest file sent from the services to the network infrastructure. This file should specify the privacy requirements to ensure a privacy-based service instantiation.