[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Next Article in Journal
Neural Network Models for Prostate Zones Segmentation in Magnetic Resonance Imaging
Previous Article in Journal
Heavy-Tailed Linear Regression and K-Means
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Proactive Data Categorization for Privacy in DevPrivOps

by
Catarina Silva
1,2,*,†,
João P. Barraca
1,2,† and
Paulo Salvador
1,3,†
1
Departamento de Electrónica, Telecomunicações e Informática, Universidade de Aveiro, 3810-193 Aveiro, Portugal
2
Instituto de Telecomunicações, 3810-164 Aveiro, Portugal
3
Institute of Electronics and Informatics Engineering of Aveiro, Universidade de Aveiro, 3810-193 Aveiro, Portugal
*
Author to whom correspondence should be addressed.
The three authors contributed equally to this work.
Information 2025, 16(3), 185; https://doi.org/10.3390/info16030185
Submission received: 26 December 2024 / Revised: 21 February 2025 / Accepted: 25 February 2025 / Published: 28 February 2025
(This article belongs to the Section Information Security and Privacy)

Abstract

:
Assessing privacy within data-driven software is challenging due to its subjective nature and the diverse array of privacy-enhancing technologies. A simplistic personal/non-personal data classification fails to capture the nuances of data specifications and potential privacy vulnerabilities. Robust, privacy-focused data categorization is vital for a deeper understanding of data characteristics and the evaluation of potential privacy risks. We introduce a framework for Privacy-sensitive Data Categorization (PsDC), which accounts for data inference from multiple sources and behavioral analysis. Our approach uses a hierarchical, multi-tiered tree structure, encompassing direct data categorization, dynamic tags, and structural attributes. PsDC is a data-categorization model designed for integration with the DevPrivOps methodology and for use in privacy-quantification models. Our analysis demonstrates its applicability in network-management infrastructure, service and application deployment, and user-centered design interfaces. We illustrate how PsDC can be implemented in these scenarios to mitigate privacy risks. We also highlight the importance of proactively reducing privacy risks by ensuring that developers and users understand the privacy “value” of data.

1. Introduction

The inherent subjectivity surrounding the definition of privacy poses significant challenges in defining and evaluating it within the contemporary digital landscape. Despite widespread privacy breaches impacting users’ lives, many individuals remain unaware of the associated risks using data-centric, personalized services and applications. Traditional privacy actions, such as closing a door or talking to a limited group, are inapplicable in the uncontrolled digital environment. In addition, varying levels of computational literacy among individuals further complicate the issue of private interactions in the digital world. The digital world is plagued with interactions with uninformed consent, opaque data flows, inadequate user interfaces for comprehending privacy risks, complex privacy policies, and even outright neglect for user privacy. Legal frameworks and regulation are essential to create a minimum level of protection, but they alone cannot guarantee individual privacy or be adapted to the inherent subjectivity of privacy.
User-centric design approaches encompass both legal and operational aspects to provide individuals with more control over their privacy. However, while software applications are presented to users as respecting their privacy, integrating robust privacy considerations into Software Development Life Cycle (SDLC) remains a significant challenge. To address this gap, a privacy-centric software-development methodology, denoted DevPrivOps [1], focuses on using Privacy Quantification (PQ) frameworks for robust privacy compliance throughout the lifecycle. DevPrivOps considers local and distributed software tests, also accommodating the risk of data combinations that result in the observation of private data from non-private data. This can result from the interaction between distributed services and user behavior.
A PQ model allows constant monitoring of the services’ level of privacy. It involves processing inputs (e.g., software, systems, or services), assessing privacy indicators, computing privacy levels using mathematical metrics or formal models, and producing a privacy scale (e.g., a color, number, or symbol). Both software developers and individuals benefit from the quantification outcome, as it enables them to make informed privacy decisions.
Consequently, there is a critical need for more robust methods to assess privacy levels. Beyond the subjective nature of PQ, the vast diversity of data across multi-services and contexts necessitates a standardized definition of data categories. In short, developing robust PQ requires a well-structured definition of data categories, enabling the classification of each data point associated with an entity.
A static data categorization (e.g., non-personal or personal) is not enough to accommodate data diversity and the different possibilities of obtaining data including the transformations applied to data that either increase or decrease its sensitivity. In addition, it brings a certain ambiguity [2] when considering data from multiple data sources, processed data (e.g., inferred data) or other data formats (e.g., metadata). In addition, understanding user preferences and external influences is also essential. Individuals naturally categorize their personal information digitally [3], but these categorizations vary widely.
Recognizing the need for a broader set of data categories, the possibility of Personally Identifiable Information (PII) inference from non-PII data during service monitoring, and the need to prioritize user preferences (privacy expectations and external influences) in categorization, we define a hierarchical, multidimensional data-categorization framework that places users and their interactions at its core. Data can be statically defined, or dynamic categories can be assigned to data from users, to cope with novel environments. The categorization considers its application in decentralized architectures and multiple data sources where PII can be inferred.
The proposed dynamic and user-centric data categorization, denoted by Privacy-sensitive Data Categorization (PsDC), is based on existing privacy literature, is supported by analysis of real datasets, and considers legal frameworks and privacy standard groups. The primary focus is on providing a dynamic data categorization that represents user behavior, preferences, and different data type representations, with applicability to enhance novel, data-centric services for the upcoming generation of telecommunication architectures [4].
The categorization scheme put forth herein is intended to serve as a cornerstone within our evolving privacy-quantification framework. This facet of our research, while still in its developmental stages, is poised for dissemination in the immediate future. The privacy-quantification model itself is designed to evaluate the privacy level of continuous data streams, a process predicated upon the aforementioned categorical divisions, observed user behavioral patterns, and the dynamically shifting context of both execution and data application. Furthermore, we are actively constructing an innovative, Machine-Learning (ML)-based approach to automate the classification of data in accordance with our defined categories, integrating both stream and semantic feature analysis. This undertaking is informed, though notably expanded upon, by the foundational concepts explored in [5]
The rest of the paper is organized as follows: Section 2 presents the current background and most relevant state-of-the-art in terms of privacy and privacy-based data categories; Section 3 describes the proposed data-categorization framework named PsDC, and its applicability is explored in Section 4; finally, the main conclusions are in Section 5.

2. Background on Privacy and Data Categories

The constant evolution of technology and the changes in society make the interpretation of the term “privacy” complex and the word cannot be simplified with a simple definition. The high subjectivity of privacy arises from the different notions of what is private and what is public that coexist in society. Subjectivity also arises from how familiar people are with the relevant discussions and the knowledge that subjects may have regarding the potential value and risks of sharing some data. Thus, privacy encompasses what is appropriate for public view, which can be influenced by several external factors, including cultural aspects. In addition, privacy extends beyond physical spaces and experiences to encompass digital information [6].
The immense potential of personal data collection for various sectors brings privacy concerns, since individual data extend far beyond their personal use. In an attempt to protect individuals’ privacy, regulations such as General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) aim to achieve a balance between technology innovation and individual control. Fortunately, across the world we are assisting in the adoption and updating of such regulations [7].
Generally, these regulations empower individuals with the adequate rights to protect their data, such as the rights to access, rectify, and delete their information, ensuring transparency and user consent in data collection. Unfortunately, we continue to witness privacy breaches, making it imperative to combine what is known as soft privacy from legal regulations and hard privacy from technical approaches [8].
Concentrating on the hard privacy concept and considering a data-driven world, every interaction on digital platforms generates data that identify individuals, potentially revealing intimate details about their lives if misused. This data can range from sensitive financial information to personal health records, as well as political, religious or sexual orientation. Digital privacy centers on the responsible handling and use of privacy-sensitive data within digital environments. These private data refer to personal information, communication, and online behavior. The idea behind digital privacy is to protect individuals’ rights and expectations to maintain the confidentiality and security of their personal information in the digital realm.
Another dimension of the problem is related to the combination of data and the inference of private information. Revealing knowledge from anonymized personal information unlocks another dimension of its value. By analyzing vast datasets, researchers and businesses can gain valuable insights into consumer behavior, market trends, and social phenomena. This knowledge fuels innovation in areas such as personalized medicine, targeted advertising, and social network analysis. However, ensuring ethical and responsible practices is crucial, guaranteeing user privacy and adherence to data-protection regulations. Moreover, even anonymized data can be vulnerable to privacy breaches through techniques such as clustering analysis, which is a powerful technique for organizing unlabeled data into meaningful groups.

2.1. Protecting Digital Privacy

User-centric approaches have been developed with the intention of creating user design interfaces where users interact in a more privacy-conscious manner. In addition, at the software-development level, privacy engineering has emerged as a systematic approach to integrating privacy requirements from the very beginning of the SDLC [1]. The idea behind this field is to empower developers to seamlessly embed privacy-preserving solutions throughout the software development. This approach includes techniques, procedures, and methodologies specifically designed to achieve privacy goals. These Privacy Engineering Methodologys (PEMs) also consider privacy-based evaluations considering the impact on various stakeholders, data identification, and annotation of data usage and communication. Privacy modelling and quantification are valid and useful methodologies to understand privacy risks. In this way, developers have clear guidance and have access to rules to conduct privacy-related engineering practices and ensure broader data-management compliance.
Reactive approaches during the SDLC can lead to vulnerabilities and disregard for user privacy. However, integrating privacy into the early stages of the SDLC offers a multitude of benefits for both end-users and developers. Proactive privacy considerations help on system designs that are more respectful of user data and inherently less susceptible to privacy breaches. Aligned to this concept of considering privacy-based strategies since the beginning of the SDLC, DevPrivOps [1] based on a PQ model enables a privacy evaluation throughout the entire lifecycle, encompassing local unit tests during development to comprehensive system-wide tests after deployment (see Figure 1). This strategy, often referred to as shift-left testing, ensures that privacy is not an afterthought but rather a core consideration throughout development.
While Privacy-enhancing Technologiess (PeTs) are actively being developed, their application during the SDLC still has a critical gap when evaluating their effectiveness. Quantifying the privacy level of services is a promising approach to bridge this gap. The DevPrivOps methodology based on a PQ model takes advantage of a mechanism to measure and assess the privacy risks associated with a software system. The privacy level can be quantified during various stages of the SDLC to identify potential privacy leaks, quantify the level of privacy protection or risk offered by the system, and guide developers towards privacy-preserving design choices. Furthermore, this approach considers the PQ across multiple services operating together and processing data from various sources. This comprehensive evaluation is crucial in today’s increasingly interconnected software landscapes, where user data often flows across multiple services and systems.

2.2. Understanding Data Privacy Categorization

The current vulnerabilities affecting personal data make it imperative to comprehend the value of data, to evaluate privacy-protection systems and treat data more adequately and effectively.
Different approaches are being proposed in the literature to understand the privacy value of some data. Some of these approaches focus on developing taxonomies, ontologies, and vocabularies that evaluate data types and their relationships based on data categories.
Different ontologies are being proposed providing a formalized knowledge representation. It acts as a shared vocabulary for entities, concepts, and their relationships within a specific domain. This unified model goes beyond simply assigning attributes to data sources. These ontologies provide a structured representation of privacy requirements, helping to differentiate them from security requirements and prevent privacy violations [9,10]. Privacy ontologies are crucial tools when addressing privacy concerns in various domains. The majority of the authors aim at proposing privacy ontologies to ensure legal compliance, to simplify the process of understanding privacy policies, or to analyze the usage of data processors that often neglect or even misuse personal data [11].
However, these works are limited to specific scenarios or they have a legal basis with more generic perspectives. Gharib et al. [11] reviewed the literature and identified that while several studies emphasize the importance of integrating privacy considerations throughout the system-design process, a lack of a comprehensive ontology that captures all the main privacy concepts and relationships persists. Trying to address this gap, the authors proposed a novel privacy ontology that incorporates the core privacy concepts and relations, considering the following dimensions: organizational, risk, treatment, privacy. In their ontology, they referred data as personal or public. They do not explore the concept of PII beyond personal or non-personal data within a privacy ontology.
A key challenge remains: How do we better define personal data from a technical perspective? A new level of dynamism is required to better understand the level of privacy required for different types or categories of data. Essentially, to protect digital privacy in current systems, we argue for the need to focus on dynamic data categorization, which can handle various data formats, storage methods, and processing techniques, including data inference or derivation. The risk of data inference arises when seemingly benign data elements, whether analyzed independently or through data fusion, enable the derivation of privacy-sensitive information pertaining to an entity.
By assigning privacy categories to the extent of data collected from various sources and types, data can be handled appropriately. Proper personal data categorization enhances transparency, risk assessment, and decision-making regarding appropriate measures, adjusting the required protection level, reducing data collection to only what is necessary for the purpose, and adjusting retention periods. Furthermore, privacy-based categories are essential to an agile PQ model applied in the SDLC, as demonstrated in the DevPrivOps methodology.
These benefits of a data-categorization method have generated widespread interest in the literature for exploring potential categories across different application domains (see Table 1).
Some authors focused on specific scenarios, such as Sun and Xu [12] in social network scenarios, but developed data-categorization models that can be generalized to multiple scenarios. One such example is the work of Oh [3], where the focus was on how people categorize information files. The study identified three different types of personal digital information categorizers based on three types of mind structures: rigid categorizers, fuzzy categorizers, and flexible categorizers. To gather data, a questionnaire-based approach was conducted with each of 18 participants.
Focusing on autonomous vehicles, Mlada et al. [15] evaluated the riskiness of personal data based on four general criteria: the nature of the controller, public availability of data, scope/amount of data, and use of data processors. The result of this evaluation affects the calculation and magnitude of the final value, denoted by R. The main criteria used to determine this value are the evaluation areas, possible outcomes, and an additional evaluation for individual evaluation coefficients. The resulting value is then categorized into four different types of evaluation.
In the field of data classification, Rumbold and Pierscionek [16] focused on big data scenarios and classified data into six different categories, and assigned them a sensitivity scale by proposing a spectrum. However, some studies prefer to generalize the categorization model rather than focusing on specific scenarios. Milne et al. [13] proposed a classification system for personal information based on the perceived risk level associated with different data types. Researchers identified four types of risks—monetary, social, physical, and psychological—and grouped data into these categories. They also collected data that were aggregated into six clusters, allowing them to consider direct and indirect data. Moreover, Oh and Belkin [14] focused on data that are difficult to categorize, such as ambiguous or anomalous information. Ambiguous information can be placed into multiple categories, while anomalous information does not fit into the existing organizational structure.
The studies presented in this context focus on not providing a static evaluation, concentrating on more dynamic and broad data interpretations, to understand whether it is possible to infer personal data from diverse types of other data.
Oh and Belkin [14] proposed a detailed study aimed at tackling the challenge of categorizing difficult data. Although the study did not cover all the issues related to this problem, it provided a valuable research direction to explore further. Furthermore, other studies focused on proposing a classification framework while considering various parameters to refine the categorization results.
When considering data categorization, it is important to assess how the proposed approaches were tested by the different studies mentioned. The majority of these studies focused on conducting interviews with real people to understand their preferences. This approach aligns with the concept of privacy, as individuals have the right to decide what information should be protected and what should not.
In addition, there is a lack of practical tools to categorize data. PII-Codex [17], a significant advance in the implementation of a privacy-based categorization framework, is a comprehensive tool designed for the detection, categorization, and severity assessment of PII. It considers extensive theoretical, conceptual, and policy research on PII categorization and severity assessment, and is integrated with PII detection software.
However, the current literature lacks analytical studies that can provide a deeper understanding of the impact of data categorization.

3. Proposed Privacy-Sensitive Data Categorization

After a careful review of a representative set of data-categorization approaches, we found value in evolving the existing categories, by considering a broader data-categorization framework denoted by PsDC focusing on technical details. In this section, we describe the strategy of this data-categorization approach, and describe the clearly identified data items that are relevant for pervasive, data centric services. We also aim to present a structure that can be enhanced with other data types, with the capability to evolve and to be fitted to novel scenarios and applications.
To better understand how to properly categorize data based on privacy aspects, we manually analyzed the features of several datasets and selected 11 representative datasets containing data that potentially reveal individuals’ private information. By referring to existing data categories from the literature, legal regulations, and standards, we found that we could not accommodate all possibilities for data categorization. This is due to the potential for data inference or derivation, even when there is no apparent linkability in multiple datasets. For this reason, we selected a set of existing data categories from existing literature, regulations, and standards, and organized them according to generically addressing needs of datasets to properly categorize data based on private aspects.
Moreover, PsDC puts the entity data owner at the center of the process, and their interest in classifying data as a mean to both understand impact and manage risk. Note that the entity can be an individual, a group of people (e.g., family) or a device (e.g., car). Focusing on user-centric approaches, we consider the entity in this case an individual as a user of the current online permanent services. We consider two main categories as possible data categorization of the user: all data that directly identify a user and its data from behavioral attitudes that include data from social interactions. Moreover, we add a dynamic view by considering data tags and data attributes. This novel approach provides PsDC with an engineering dimension to analyze data and understand their relationships, including the derivation of private data from other data.
The PsDC framework utilizes a hierarchical structure for data categorization with two primary categories. Data that directly identify individuals such as name, age, or usernames belong to the broader identifier category. This category subdivides into different subcategories including human, biological, and devices. Following the idea of the tree structure, each of them also subdivides into more specific data categories. Thus, we capture a high dimension of data that directly identify individuals; the human category refers to data that directly identify an individual from itself; intuitively biological refers to data from biological sources such as DNA, and devices that provide that from direct usable devices from the user.
The other primary category pertains to behavior analysis, which reflects individual actions and interactions. This category encompasses data from social and online interactions, transactional behavior (including shopping patterns or legal activities), digital footprints (such as social media posts), and actions that can be habitual or singular. This category is crucial for capturing data that, while not directly identifying individuals, can still compromise their privacy. Rather than analyzing behavior data and then attempting to categorize them as direct data, we proactively identify behavior data as a broader category to categorize data as they are obtained.
The “identifier” category draws from recognized standards and a thorough review of existing literature. The “behavioral” category, in contrast, offers a novel perspective on data classification, recognizing that behavioral data is not easily captured by conventional, static categorization. Furthermore, we have incorporated tags and attributes to provide a dynamic viewpoint and enhanced control over the data-classification process.
Each category delves further into sub-levels, with the final level representing the most specific data classifications within the hierarchy (see Table 2). This final level offers the most granular detail and allows for a wider range of possible values associated with each data type.
Data itself is not the only factor influencing the potential to identify an individual. We argue for the necessity to integrate dynamic evaluations, to achieve a deeper understanding of the relationship between data points, and to facilitate the detection of potential inferences of personal data. For this reason, the framework also considers data tags (see Table 3) and data attributes (see Table 4). Tags provide an extra dimension to data categorization, allowing for the identification of specific contexts, for instance, health-related or sensor-generated data. Attributes, conversely, delineate data characteristics such as their format or stage of development.
Regarding data tags, our proposed data-categorization framework considers dynamic tags to categorize data. These tags can be directly assigned to data, resulting as a data categorization itself or complement the categorization provided by the first described tree. Data tags resulted from the diverse ways to obtain data, data representations, and interpretations.
Data tags encompasses data from health sources that can be medical information or wellbeing that consider health data, skills, payment types, frequency, sensing, spatial, temporal, logical, or statistics.
The medical tag encompasses anything related to an individual’s diagnostics, public disorders such as a pandemic situation, or symptoms. Wellbeing tags encompass data that can be from non-medical sources; however, they are related to health aspects such as blood pressure captured by a sensing device typically denoted as well-being data.
Furthermore, temporal and spatial data consider the high inference possibilities for private data from non-private data. The different representations of temporal and spatial data are not addressed in the remaining categorizations. In addition, there is a critical need to understand these representations even more during pattern recognition, by considering a huge set of data representations for each one.
Data attributes further refine the categorization process, enabling a deeper understanding of data characteristics and their potential to reveal personal information. This includes aspects such as data storage format (organization, ciphered, or plaintext), maturity, metadata, direct or indirect data, and the presence of URIs or descriptive texts.
These attributes provide an additional layer of data categorization, giving an extra understanding of data “value”. Plaintext or encrypted significantly impacts observability and inference possibilities. Plaintext data are readily accessible for analysis, whereas encrypted data require decryption before analysis. Similarly, anonymized data offer a certain level of protection, but they still carry associated privacy risks, such as the risk of re-identification.
Furthermore, metadata associated with data can be categorized based on their origin (social, technical, operational, or business). This metadata categorization provides an additional layer of information that can contribute to inferences about personal data. These formats further refine data categorization and highlight potential privacy concerns associated with different data types.
Finally, another attribute that should be highlighted is the string. While the remaining categories, tags, and attributes refer to the data themselves, this attribute subdivides into three subattributes that can be processed to obtain more data and enable subsequent categorization. A simple example is when considering the name “Catarina”, which is categorized as “Identifier/Human/Subject”. However, if the dataset field contains “My name is Catarina”, we can apply the categorization with the attribute “String/Description_Text” since it also reveals private data, allowing us to then identify the name within the string.
Descriptions for all categories are detailed in Table 2 for the most specific categories of the broad categories “identifier” and “behavior”; Table 3 for the data tags for the categorization model; and Table 4 for the data attributes for the categorization model.
Despite the proposed advancements in data privacy categorization in our proposed framework, we found value in improving the approach, adding one more perspective: individual’s privacy expectations and privacy influences. When considering individuals as the entity in the data generator in the proposed framework, consider that privacy preferences and influences are crucial due to the high subjectivity associated with the privacy meaning and decisions.
Privacy preferences and influence properties within PsDC provide our framework with high dynamism through the use of different weights for different data categories. Privacy is inherently subjective, with users possessing diverse notions of what constitutes personal data. PsDC ensures that individuals can define their own privacy preferences, including expectations and influence.
This user-centric approach allows individuals to assign weights to different data categories, reflecting their personal perception of a privacy violation (perceived risk). For instance, a user might consider biometric data to be more sensitive than browsing history, and would assign a higher weight to the “biometric data” category within the framework. These weights can then be used to dynamically calculate the overall privacy level associated with data access or processing activities.
Finally, the PsDC, presented as a simplified data-categorization model, is structurally hierarchical rather than a single mathematical function. However, we are developing two functional models: one for automated data classification using PsDC, employing semantic and stream features to cluster and categorize data, particularly effective in IoT contexts; and another for privacy quantification, which utilizes PsDC categories, sensitivity levels, and behavioral data to estimate privacy risks within data streams. These ongoing developments aim to provide a more dynamic and quantifiable approach to data privacy.

4. Applying PsDC in DevPrivOps Use Cases

DevPrivOps [1] is a novel methodology for software development that puts privacy as a primary requirement and not an afterthought. To achieve this, it is necessary to develop sufficient tools that help developers to produce privacy-enhanced applications, but also means the users track and understand how their data are gathered, processed, and used.
PsDC can be understood as a machine-readable and human-understandable file containing mandatory privacy fields, such as data type, tags, attributes, associated risk, and personalized fields to include user preferences. This file can serve as input for PQ, evaluating whether the software meets the required privacy standards.
Recognizing the inherently personal nature of privacy, we refrain from assigning fixed sensitivity levels to categories. Instead, we offer a framework, the PsDC, that empowers users or privacy specialists to define the appropriate sensitivity level for each category. We are, in fact, exploring the use of ML models to estimate the sensitivity of each category based on a user’s historical behavior. However, this aspect of our work is currently in its developmental stages.
As detailed in Section 2, the proposed categories were designed to serve as inputs for the privacy quantification component. As previously mentioned (in Section 1), an automated categorization paradigm, founded upon ML principles, is currently undergoing active development. This endeavor draws inspiration from the seminal work [5], and integrates both stream and semantic feature analysis to establish a sophisticated metric for data stream similarity. By leveraging this nuanced similarity assessment, we aim to cluster previously uncategorized data within our curated and labeled data repository. Subsequently, category and label propagation from the established data streams to the novel ones is facilitated. The aforementioned inspirational work demonstrates promising outcomes in the context of IoT stream categorization. We anticipate analogous results within our domain, given the shared theoretical underpinnings of the respective models.
It is crucial to acknowledge that while several PeTs exist, with differential privacy being the most prominent, their applicability is often limited. These methods primarily focus on minimizing fingerprinting and fall short when considering broader concerns such as profiling and indirect inference. Research by [18,19,20] has demonstrated that, despite its potential to degrade ML accuracy, the effectiveness of differential privacy is highly dependent on the specific dataset and model. In fact, some models exhibit improved accuracy when trained on anonymized data compared to the raw dataset. Another significant study by [21] highlights the substantial computational cost associated with differential privacy, particularly for large datasets, often outweighing its privacy-enhancing capabilities.
Furthermore, with the rapid advancements in Large Language Models (LLM), and their remarkable ability to dissect and navigate complex, multi-dimensional datasets, the potential for their use in PII inference becomes a tangible concern. While presently, these models are primarily optimized for coherent text generation within a request-response framework, the fundamental architecture of LLMs, often built upon variations of the transformer layer, could theoretically be adapted for PII inference given access to diverse data sources. The current unknown lies in the scale of training and the necessity for meticulously curated datasets to ensure a high degree of accuracy in this specific application. While it is conceivable to leverage the internal workings of LLMs to implement our categorization method, this approach presents several challenges: firstly, the computational resources and dataset sizes needed for effective performance remain uncertain; secondly, the suitability of pre-trained models for this specific application is questionable; and thirdly, our current resources do not permit such extensive exploration. Furthermore, our developing categorization model uniquely combines stream-based features with textual analysis. This approach offers a distinct analytical perspective while requiring significantly less computational power.
Considering these observations, it is reasonable to conclude that privacy categorization remains an emerging field with immense potential to significantly enhance privacy for the average user.
This approach contributes to various aspects of privacy-conscious software development, including user-centric design interfaces, enhancing the possibility of informed consent for data collection, providing users with greater control over their personal data, autonomous mechanisms for data-breach detection, and finally, considering non-human users that collect individual data, such as smart devices.

4.1. Privacy Quantification with PsDC in DevPrivOps

Employing a data-categorization framework, researchers, software developers, and privacy engineers gain a deeper understanding of data-sensitivity levels across diverse scenarios. This enhanced understanding empowers the development of targeted privacy-preserving measures tailored to specific data categories and potential privacy risks. This targeted approach optimizes resource allocation and ensures that sensitive data receive the necessary level of protection.
PsDC is fundamental to PQ within the context of DevPrivOps. On the left side of the DevPrivOps lifecycle, PsDC enables the assignment of different weights to data, which can then be considered during privacy calculations. Additionally, the left side of the lifecycle also greatly benefits from PsDC. When combining multiple data sources, it is possible to infer PII. By using a dynamic, standardized, broader set of data categories, we can provide a more accurate data categorization that accounts for the dynamic nature of data. This allows for proactive detection of PII data inference during PQ.
In addition, at the right side of the DevPrivOps, we can also take advantage of PsDC in microservice architectures where services are distributed, deployed, and interact with one another. A PQ framework can calculate the possibility of a privacy breach when considering only the service (left side) and then we can restrict the communication between different services based on the output of the quantification. This means that the output of the PQ can work as a threshold to enable services communication.
Research on the relationship between PsDC and potential privacy risks is crucial to better determine the PQ result. Assigning a privacy-based category with potential risks increases the likelihood of effective risk mitigation. Moreover, having a clear understanding of the potential privacy risks when a data breach occurs associated with a privacy-based data type can help entities or organizations respond effectively and minimize the impact. Note that privacy is a very subjective concept, which makes it imperative not to assign static sensitivity levels to categories but also to provide the framework (in this case, the PsDC) for users or privacy specialists to assign the privacy level for privacy quantification.
Finally, it is also possible to consider using PsDC during PQ, not only as a measure of privacy implications but also as a pre-evaluation for system privacy requirements. Proactively quantifying the privacy level helps developers and engineers to continue their tasks with a greater awareness of privacy.

4.2. Network Data Communication with PsDC

PsDC applies to current network topologies where data are processed and analyzed closer to the data sources. Appropriately classifying data at the network edge, data processors can prioritize security measures for the most sensitive data categories. Thus, there is a higher risk of privacy breaches providing stronger protection mitigating potential vulnerabilities. Conversely, low-risk data can benefit from more relaxed security measures, optimizing resource allocation within the smart home system.
Moreover, because edge data are often diverse and can be combined with external sources, it is essential to comprehend the evolving privacy implications. A dynamic approach enables continuous assessment of data sensitivity at the edge, understanding the privacy requirements for data inference or combinations and implementing appropriate security controls. For instance, with network flow streams, log analysis and data categorization based on various sensitivity levels become crucial to determine when PII is being obtained from non-critical data. By detecting such situations, software can be adapted to avoid this issue, ensuring a high level of compliance and minimizing legal risks. PsDC assists in this by offering a wide range of different PII categories in a structured hierarchy that places the user at the center of the categorization. The additional dynamic data types and subtypes introduce a dynamic nature to the categorization.
Continuous monitoring advocates for a dynamic approach to categorizing data to identify when personal data might be inferred from seemingly low-risk data. Systems should be equipped with a module to dynamically capture network traffic, analyze it, categorize data based on PsDC, and then send the completed machine-readable file to the component responsible for quantifying privacy.
A concrete scenario where PsDC can be applied is in smart environments, including smart cities, homes, or factories. PsDC can be utilized when integrating a new device into an existing network of interconnected smart devices. Based on the device’s function, the data it collects should be categorized according to PsDC categories. The device should then only be enabled in cases where user privacy is not compromised, leveraging the analysis of the PQ model’s output.
In addition, the categories of data processed by the smart device help select the most appropriate device. In cases where a device has multiple functions but we only need one, for example, we can use privacy-based categories to identify and restrict data storage to only the required data.
In this context, PsDC will be applied during the onboarding of network services within the scope of a European project named RIGOUROUS (for more information, see https://rigourous.eu/. In this project, we are considering the integration of a machine-readable and human-understandable file as a privacy manifest file sent from the services to the network infrastructure. This file should specify the privacy requirements to ensure a privacy-based service instantiation.

4.3. Human-Centric Design Interfaces with PsDC

A data-categorization framework with easily understandable data categories can help users comprehend and select their preferences regarding data processing and analysis based on the data categories and their sensitivity. User-friendly interfaces allowing users to intuitively understand the categories of data collected and express their preferences contribute to addressing the lack of data flow transparency. Moreover, focusing on a proper data categorization based on PsDC it is possible to not concentrate only on the extensive and hard to understand privacy policies, increasing the occurrence of informed consent.
To achieve this, we first need to conduct a user research study to understand how to effectively depict the information about the data collected and their corresponding categories. It is crucial to present data categories, types, and subtypes in a simple and intuitive manner that any user can comprehend to understand the privacy risk. However, this is not the sole focus. After data categorization, we intend to produce a value as a result of PQ. The user research should also explore the most suitable way to present this result. A possible option is a privacy scale (based on numbers, colors, or letters), but this should be further investigated by gathering opinions from diverse users to incorporate their feedback.
Furthermore, PsDC is also useful when the user is not an individual, like a car or a device. In such cases, the system should have a suitable module to process data categorization. A machine-readable file is essential for this process. Additionally, automatic mechanisms can be employed to dynamically identify data breaches, and PQ can produce output representing the privacy violation.

5. Conclusions

This work introduces a novel data-categorization model, named PsDC, designed to align with DevPrivOps principles and facilitate privacy-quantification models. Data categorization is fundamental for implementing robust data privacy measures. Our proposed categories adopt a dynamic approach, accounting for data inference from processed data, and user expectations and preferences. Our model also offers a broader spectrum of categories than those found in legal frameworks. This avoids the static, generic categorizations common in legal contexts, which often rely on a binary division between non-personal and personal data.
Integrating PsDC into the PQ stages of the DevPrivOps methodology offers significant advantages by elucidating the “value” of data derived from user behavior, interactions, and preferences. This allows for the proactive identification of PII as it emerges from non-PII data. Furthermore, this standardized approach promotes consistent decision-making regarding data-handling practices across diverse departments and teams, minimizing the risk of data misclassification.
The PsDC framework emphasizes a simple, intuitive design and nomenclature to ensure ease of understanding and implementation for data owners, users, and software developers. Future work will involve applying PsDC to the use cases described. We are also actively developing two complementary models: one for the automated categorization of raw data using PsDC, and another for PQ that leverages these categories to accurately assess the privacy level of data streams. We anticipate publishing these models in a subsequent publication.

Author Contributions

Conceptualization, C.S.; methodology, C.S.; validation, C.S., J.P.B. and P.S.; formal analysis, C.S.; investigation, C.S.; writing—original draft preparation, C.S.; writing—review and editing, C.S., J.P.B. and P.S.; visualization, C.S.; supervision, J.P.B. and P.S.; project administration, J.P.B.; funding acquisition, J.P.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by FCT-Fundação para a Ciência e Tecnologia, I.P. by project reference UIDB/50008, and DOI identifier 10.54499/UIDB/50008. This work has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101095933 (RIGOUROUS project).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Silva, C.; Cunha, V.A.; Barraca, J.P.; Salvador, P. Privacy-Based Deployments: The Role of DevPrivOps in 6G Mobile Networks. IEEE Commun. Mag. 2024, 62, 66–72. [Google Scholar] [CrossRef]
  2. Malkin, N. Contextual Integrity, Explained: A More Usable Privacy Definition. IEEE Secur. Priv. 2023, 21, 58–65. [Google Scholar] [CrossRef]
  3. Oh, K.E. Types of personal information categorization: Rigid, fuzzy, and flexible. J. Assoc. Inf. Sci. Technol. 2017, 68, 1491–1504. [Google Scholar] [CrossRef]
  4. RIGOUROUS. Design Plan of the Multi-Domain Automated Security Orchestration, Trust-Management, and Deployment; UWS: Paisley, Scotland, 2023. [Google Scholar] [CrossRef]
  5. Antunes, M.; Gomes, D.; Aguiar, R.L. Towards IoT data classification through semantic features. Future Gener. Comput. Syst. 2018, 86, 792–798. [Google Scholar] [CrossRef]
  6. Mason, S. Exploring Privacy from a Philosophical Perspective: Conceptual and Normative Dimensions. In Human Privacy in Virtual and Physical Worlds: Multidisciplinary Perspectives; Springer Nature: Cham, Switzerland, 2024; pp. 23–45. [Google Scholar]
  7. Piper, D. Data Protection Laws of the World. 2025. Available online: https://www.dlapiperdataprotection.com/ (accessed on 6 December 2024).
  8. Hoepman, J.H. Privacy Is Hard and Seven Other Myths: Achieving Privacy Through Careful Design; MIT Press: Cambridge, MA, USA, 2021. [Google Scholar]
  9. Gharib, M.; Giorgini, P.; Mylopoulos, J. An ontology for privacy requirements via a systematic literature review. J. Data Semant. 2020, 9, 123–149. [Google Scholar] [CrossRef]
  10. Pandit, H.J.; Esteves, B.; Krog, G.P.; Ryan, P.; Golpayegani, D.; Flake, J. Data Privacy Vocabulary (DPV)—Version 2. In Proceedings of the 23rd International Semantic Web Conference (ISWC 2024), Baltimore, MD, USA, 11–15 November 2024; Volume 1, pp. 1–24. [Google Scholar] [CrossRef]
  11. Gharib, M.; Mylopoulos, J.; Giorgini, P. COPri-A core ontology for privacy requirements engineering. In Proceedings of the Research Challenges in Information Science: 14th International Conference, RCIS 2020, Limassol, Cyprus, 23–25 September 2020; Proceedings 14. Springer: Berlin/Heidelberg, Germany, 2020; pp. 472–489. [Google Scholar]
  12. Sun, Q.; Xu, Y. Research on Privacy Concerns of Social Network Users. In Proceedings of the ICCC, Prague, Czech Republic, 16–20 September 2019; pp. 1453–1460. [Google Scholar] [CrossRef]
  13. Milne, G.R.; Pettinico, G.; Hajjat, F.M.; Markos, E. Information sensitivity typology: Mapping the degree and type of risk consumers perceive in personal data sharing. J. Consum. Aff. 2017, 51, 133–161. [Google Scholar] [CrossRef]
  14. Oh, K.E.; Belkin, N.J. Understanding what personal information items make categorization difficult. Proc. Am. Soc. Inf. Sci. Technol. 2014, 51, 1–3. [Google Scholar] [CrossRef]
  15. Mlada, M.; Holý, R.; Jirovský, J.; Kasalický, T. Protection of personal data in autonomous vehicles and its data categorization. In Proceedings of the SCSP, Prague, Czech Republic, 26–27 May 2022; pp. 1–5. [Google Scholar] [CrossRef]
  16. Rumbold, J.M.; Pierscionek, B.K. What are data? A categorization of the data sensitivity spectrum. Big Data Res. 2018, 12, 49–59. [Google Scholar] [CrossRef]
  17. Rosado, E.J. PII-Codex: A Python library for PII detection, categorization, and severity assessment. J. Open Source Softw. 2023, 8, 5402. [Google Scholar] [CrossRef]
  18. Slijepčević, D.; Henzl, M.; Klausner, L.D.; Dam, T.; Kieseberg, P.; Zeppelzauer, M. k-Anonymity in practice: How generalisation and suppression affect machine learning classifiers. Comput. Secur. 2021, 111, 102488. [Google Scholar] [CrossRef]
  19. Díaz, J.S.P.; García, A.L. Comparison of machine learning models applied on anonymized data with different techniques. In Proceedings of the 2023 IEEE International Conference on Cyber Security and Resilience (CSR), Venice, Italy, 31 July–2 August 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 618–623. [Google Scholar] [CrossRef]
  20. Silva, C.; Barraca, J.P.; Salvador, P. Evaluating the Effectiveness of Differential Privacy Against Profiling (Submitted). In Proceedings of the 2025 The International Conference on Consumer Technology (ICCT-Europe 2025), Algarve, Portugal, 28–30 April 2025. [Google Scholar]
  21. Oprescu, A.; Misdorp, S.; van Elsen, K. Energy cost and accuracy impact of k-anonymity. In Proceedings of the 2022 International Conference on ICT for Sustainability (ICT4S), Plovdiv, Bulgaria, 13–17 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 65–76. [Google Scholar] [CrossRef]
Figure 1. DevPrivOps Lifecyle (adapted from [1]).
Figure 1. DevPrivOps Lifecyle (adapted from [1]).
Information 16 00185 g001
Table 1. State-of-the-art on data privacy categories.
Table 1. State-of-the-art on data privacy categories.
Ref.ApplicabilitySensitivity Labels
[12]Social NetworksPersonal attributes; Behavioral attributes; Social attributes
[13]GeneralFinancial information; Basic demographics; Personal preferences; Contact information; Secure identifiers; Community interactions
[14]GeneralAmbiguous; Anomalous
[3]Information filesRigid; Fuzzy; Flexible structures
[15]Autonomous vehicleNo significant information; Some significant information; Potential identifications of persons; Direct identification of persons
[16]Big DataNon-personal data; Human–machine interactions; Human demographics, behavior, thoughts and opinions; Readily apparent human characteristics (unprotected); Readily apparent human characteristics (protected); Medical or healthcare data
Table 2. Data-categorization model based on identifier and broader behavior categories.
Table 2. Data-categorization model based on identifier and broader behavior categories.
Level 0Level 1Level 2DescriptionExample
IdentifierHumanOnlineIdentifies individuals using online identifiersUsernames, email addresses
  ObjectIdentifies individuals through objects associated with themVehicles, clothing
  SubjectIdentifies individuals based on their dataName, Age
  Criminal recordIdentifies individuals based on their criminal historyArrest records, conviction history
  GenderIdentifies individuals based on their genderMale, female, non-binary
 BiologicalBiometricIdentifies individuals based on unique biological traitsFingerprints, facial recognition
  GeneticIdentifies individuals based on their genetic informationDNA sequences, genetic markers
 DevicesAddressIdentifies devices based on their locationIP address, MAC address
  CharacteristicsIdentifies devices based on their technical specificationsOperating system, hardware configuration
BehavioralInteractionsConnectionsIdentifies individuals based on their social networks and relationshipsFriends, family, colleagues
  CommunicationIdentifies individuals based on their communication interactionsMessages, phone calls
  CollaborationIdentifies individuals based on their collaborative activitiesJoint projects, shared documents
  RelationshipsIdentifies individuals based on their personal and professional relationshipsRelationships, friendships
  BeliefsIdentifies individuals based on their beliefs and valuesPolitical views, religious beliefs
 TransactionalShoppingIdentifies individuals based on their purchasing behaviorPurchase history, shopping cart data
  FinancialIdentifies individuals based on their financial transactionsBank transactions, credit card usage
  LegalIdentifies individuals based on their legal activitiesContracts, legal documents
 Digital FootprintContent CreationIdentifies individuals based on the content they create onlineBlog posts, social media posts
  Content ConsumptionIdentifies individuals based on the content they consume onlineWebsite visits, video views
  Online BehaviorIdentifies individuals based on their online activitiesLikes, shares, comments
 ActionsSingleIdentifies individuals based on their individual actionsClicking a link, making a purchase
  HabitIdentifies individuals based on their repeated actionsDaily routines, frequent purchases
Table 3. Data tags for data-categorization model.
Table 3. Data tags for data-categorization model.
Level 0Level 1Level 2DescriptionExample
MedicalDiagnosisPhysicalPhysical health conditionHeart disease, diabetes, broken bone
  MentalMental health conditionDepression, anxiety, bipolar disorder
  HealthyGeneral health statePhysically fit, mentally sound
 Public Public health concernsInfectious diseases, pandemic
 Symptoms Physical or mental symptomsFever, cough, fatigue, insomnia
WellbeingPhysical Overall physical health and conditionBlood pressure, weight, cholesterol levels
 Mental Cognitive and emotional well-beingStress levels, mood swings, cognitive function
 Emotional Ability to understand, manage, and express emotions effectivelyEmpathy, self-awareness
 Social Quality of relationshipsSocial interaction frequency, community involvement
SkillsTechnical Technical abilitiesProgramming, data analysis
 Soft Interpersonal and communication skillsLeadership, teamwork, problem-solving
 Domain Specific Industry-specific knowledgeMedical knowledge, legal knowledge, financial knowledge
PaymentFinancial Monetary transactionsCredit card payments, bank transfers, salary
 Non-Financial Non-monetary transactionsVouchers, miles
FrequencyQuantitative Numerical frequencyNumber of times per day, week, month
 Qualitative Descriptive frequencyOften, sometimes, rarely
 History Past occurrencesPrevious purchases, past medical records
RelatedAbsolute Independent valueWeight, height, temperature
 Relative Comparative valuePercentage change, ratio, rate
SensingHuman Human-generated dataInterviews, user feedback, exercise activity.
 Environmental Environmentally generated dataWeather data, air quality, noise levels
 Surveillance Surveillance-generated dataCCTV footage, traffic cameras
SpatialLocality Geographic locationCity, country, GPS coordinates
 Relative Relative spatial relationshipsNear, far, above, below 
 Absolute Precise spatial coordinatesLatitude, longitude, altitude
 Online Digital locationWebsite URL, social media profile
TemporalTimestamp Exact point in time22/11/2023 12:34
 Interval Duration of time1 h, 2 days, 3 weeks
 Numeric Numerical representation of timeAge, time in seconds, milliseconds
 Date Calendar date22/11/2023
 Year Calendar year2023, 2024
 Time Time of day12:34:56
 Time with time zone Time with time zone information2023-11-22 12:34:56 CET
LogicalBoolean Binary valuesYes/No, True/False, 1/0
Statistics  Statistical measuresMean, median, mode, standard deviation
Table 4. Data attributes for data-categorization model.
Table 4. Data attributes for data-categorization model.
Level 0Level 1DescriptionExample
OrganizationStructuredData with a predefined formatCSV, Excel, SQL databases
 Semi-structuredData with some structure but not strictly definedXML, JSON
 UnstructuredData without a predefined formatText documents, images, audio, video
FormatCipheredEncrypted dataEncrypted emails, sessions keys
 PlaintextUnencrypted dataText documents, CSV files
MaturityAnonymizedData with personal information removedAnonymized surveys, census data
 RawUnprocessed dataSensor data, log files
 PseudonymizedData with personal information replaced with identifiersPseudonymized medical records
MetadataSocialData about the social context of dataAuthor, publisher, copyright, photos data
 TechnicalData about the technical characteristics of dataFile format, encoding, compression
 OperationalData about the operational context of dataCreation date, modification date, access permissions
 BusinessData about the business context of dataData owner, data usage policies
StringURITextual data representing a URIURLs, hyperlinks
 Description_TextTextual data describing somethingProduct descriptions, customer reviews
 NumberNumerical dataSales figures, age, temperature
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Silva, C.; Barraca, J.P.; Salvador, P. Proactive Data Categorization for Privacy in DevPrivOps. Information 2025, 16, 185. https://doi.org/10.3390/info16030185

AMA Style

Silva C, Barraca JP, Salvador P. Proactive Data Categorization for Privacy in DevPrivOps. Information. 2025; 16(3):185. https://doi.org/10.3390/info16030185

Chicago/Turabian Style

Silva, Catarina, João P. Barraca, and Paulo Salvador. 2025. "Proactive Data Categorization for Privacy in DevPrivOps" Information 16, no. 3: 185. https://doi.org/10.3390/info16030185

APA Style

Silva, C., Barraca, J. P., & Salvador, P. (2025). Proactive Data Categorization for Privacy in DevPrivOps. Information, 16(3), 185. https://doi.org/10.3390/info16030185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop