Open AccessArticle

Proactive Data Categorization for Privacy in DevPrivOps

Catarina Silva

^1,2,*,†

João P. Barraca

^1,2,†

and

Paulo Salvador

^1,3,†

Departamento de Electrónica, Telecomunicações e Informática, Universidade de Aveiro, 3810-193 Aveiro, Portugal

Instituto de Telecomunicações, 3810-164 Aveiro, Portugal

Institute of Electronics and Informatics Engineering of Aveiro, Universidade de Aveiro, 3810-193 Aveiro, Portugal

Author to whom correspondence should be addressed.

^†

The three authors contributed equally to this work.

Information 2025, 16(3), 185; https://doi.org/10.3390/info16030185

Submission received: 26 December 2024 / Revised: 21 February 2025 / Accepted: 25 February 2025 / Published: 28 February 2025

(This article belongs to the Section Information Security and Privacy)

Download

Browse Figure

Versions Notes

Abstract

Assessing privacy within data-driven software is challenging due to its subjective nature and the diverse array of privacy-enhancing technologies. A simplistic personal/non-personal data classification fails to capture the nuances of data specifications and potential privacy vulnerabilities. Robust, privacy-focused data categorization is vital for a deeper understanding of data characteristics and the evaluation of potential privacy risks. We introduce a framework for Privacy-sensitive Data Categorization (PsDC), which accounts for data inference from multiple sources and behavioral analysis. Our approach uses a hierarchical, multi-tiered tree structure, encompassing direct data categorization, dynamic tags, and structural attributes. PsDC is a data-categorization model designed for integration with the DevPrivOps methodology and for use in privacy-quantification models. Our analysis demonstrates its applicability in network-management infrastructure, service and application deployment, and user-centered design interfaces. We illustrate how PsDC can be implemented in these scenarios to mitigate privacy risks. We also highlight the importance of proactively reducing privacy risks by ensuring that developers and users understand the privacy “value” of data.

Keywords:

data categories; privacy engineering; DevPrivOps; behavioral analysis

1. Introduction

The inherent subjectivity surrounding the definition of privacy poses significant challenges in defining and evaluating it within the contemporary digital landscape. Despite widespread privacy breaches impacting users’ lives, many individuals remain unaware of the associated risks using data-centric, personalized services and applications. Traditional privacy actions, such as closing a door or talking to a limited group, are inapplicable in the uncontrolled digital environment. In addition, varying levels of computational literacy among individuals further complicate the issue of private interactions in the digital world. The digital world is plagued with interactions with uninformed consent, opaque data flows, inadequate user interfaces for comprehending privacy risks, complex privacy policies, and even outright neglect for user privacy. Legal frameworks and regulation are essential to create a minimum level of protection, but they alone cannot guarantee individual privacy or be adapted to the inherent subjectivity of privacy.

User-centric design approaches encompass both legal and operational aspects to provide individuals with more control over their privacy. However, while software applications are presented to users as respecting their privacy, integrating robust privacy considerations into Software Development Life Cycle (SDLC) remains a significant challenge. To address this gap, a privacy-centric software-development methodology, denoted DevPrivOps [1], focuses on using Privacy Quantification (PQ) frameworks for robust privacy compliance throughout the lifecycle. DevPrivOps considers local and distributed software tests, also accommodating the risk of data combinations that result in the observation of private data from non-private data. This can result from the interaction between distributed services and user behavior.

A PQ model allows constant monitoring of the services’ level of privacy. It involves processing inputs (e.g., software, systems, or services), assessing privacy indicators, computing privacy levels using mathematical metrics or formal models, and producing a privacy scale (e.g., a color, number, or symbol). Both software developers and individuals benefit from the quantification outcome, as it enables them to make informed privacy decisions.

Consequently, there is a critical need for more robust methods to assess privacy levels. Beyond the subjective nature of PQ, the vast diversity of data across multi-services and contexts necessitates a standardized definition of data categories. In short, developing robust PQ requires a well-structured definition of data categories, enabling the classification of each data point associated with an entity.

A static data categorization (e.g., non-personal or personal) is not enough to accommodate data diversity and the different possibilities of obtaining data including the transformations applied to data that either increase or decrease its sensitivity. In addition, it brings a certain ambiguity [2] when considering data from multiple data sources, processed data (e.g., inferred data) or other data formats (e.g., metadata). In addition, understanding user preferences and external influences is also essential. Individuals naturally categorize their personal information digitally [3], but these categorizations vary widely.

Recognizing the need for a broader set of data categories, the possibility of Personally Identifiable Information (PII) inference from non-PII data during service monitoring, and the need to prioritize user preferences (privacy expectations and external influences) in categorization, we define a hierarchical, multidimensional data-categorization framework that places users and their interactions at its core. Data can be statically defined, or dynamic categories can be assigned to data from users, to cope with novel environments. The categorization considers its application in decentralized architectures and multiple data sources where PII can be inferred.

The proposed dynamic and user-centric data categorization, denoted by Privacy-sensitive Data Categorization (PsDC), is based on existing privacy literature, is supported by analysis of real datasets, and considers legal frameworks and privacy standard groups. The primary focus is on providing a dynamic data categorization that represents user behavior, preferences, and different data type representations, with applicability to enhance novel, data-centric services for the upcoming generation of telecommunication architectures [4].

The categorization scheme put forth herein is intended to serve as a cornerstone within our evolving privacy-quantification framework. This facet of our research, while still in its developmental stages, is poised for dissemination in the immediate future. The privacy-quantification model itself is designed to evaluate the privacy level of continuous data streams, a process predicated upon the aforementioned categorical divisions, observed user behavioral patterns, and the dynamically shifting context of both execution and data application. Furthermore, we are actively constructing an innovative, Machine-Learning (ML)-based approach to automate the classification of data in accordance with our defined categories, integrating both stream and semantic feature analysis. This undertaking is informed, though notably expanded upon, by the foundational concepts explored in [5]

The rest of the paper is organized as follows: Section 2 presents the current background and most relevant state-of-the-art in terms of privacy and privacy-based data categories; Section 3 describes the proposed data-categorization framework named PsDC, and its applicability is explored in Section 4; finally, the main conclusions are in Section 5.

2. Background on Privacy and Data Categories

The constant evolution of technology and the changes in society make the interpretation of the term “privacy” complex and the word cannot be simplified with a simple definition. The high subjectivity of privacy arises from the different notions of what is private and what is public that coexist in society. Subjectivity also arises from how familiar people are with the relevant discussions and the knowledge that subjects may have regarding the potential value and risks of sharing some data. Thus, privacy encompasses what is appropriate for public view, which can be influenced by several external factors, including cultural aspects. In addition, privacy extends beyond physical spaces and experiences to encompass digital information [6].

The immense potential of personal data collection for various sectors brings privacy concerns, since individual data extend far beyond their personal use. In an attempt to protect individuals’ privacy, regulations such as General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) aim to achieve a balance between technology innovation and individual control. Fortunately, across the world we are assisting in the adoption and updating of such regulations [7].

Generally, these regulations empower individuals with the adequate rights to protect their data, such as the rights to access, rectify, and delete their information, ensuring transparency and user consent in data collection. Unfortunately, we continue to witness privacy breaches, making it imperative to combine what is known as soft privacy from legal regulations and hard privacy from technical approaches [8].

Concentrating on the hard privacy concept and considering a data-driven world, every interaction on digital platforms generates data that identify individuals, potentially revealing intimate details about their lives if misused. This data can range from sensitive financial information to personal health records, as well as political, religious or sexual orientation. Digital privacy centers on the responsible handling and use of privacy-sensitive data within digital environments. These private data refer to personal information, communication, and online behavior. The idea behind digital privacy is to protect individuals’ rights and expectations to maintain the confidentiality and security of their personal information in the digital realm.

Another dimension of the problem is related to the combination of data and the inference of private information. Revealing knowledge from anonymized personal information unlocks another dimension of its value. By analyzing vast datasets, researchers and businesses can gain valuable insights into consumer behavior, market trends, and social phenomena. This knowledge fuels innovation in areas such as personalized medicine, targeted advertising, and social network analysis. However, ensuring ethical and responsible practices is crucial, guaranteeing user privacy and adherence to data-protection regulations. Moreover, even anonymized data can be vulnerable to privacy breaches through techniques such as clustering analysis, which is a powerful technique for organizing unlabeled data into meaningful groups.

2.1. Protecting Digital Privacy

User-centric approaches have been developed with the intention of creating user design interfaces where users interact in a more privacy-conscious manner. In addition, at the software-development level, privacy engineering has emerged as a systematic approach to integrating privacy requirements from the very beginning of the SDLC [1]. The idea behind this field is to empower developers to seamlessly embed privacy-preserving solutions throughout the software development. This approach includes techniques, procedures, and methodologies specifically designed to achieve privacy goals. These Privacy Engineering Methodologys (PEMs) also consider privacy-based evaluations considering the impact on various stakeholders, data identification, and annotation of data usage and communication. Privacy modelling and quantification are valid and useful methodologies to understand privacy risks. In this way, developers have clear guidance and have access to rules to conduct privacy-related engineering practices and ensure broader data-management compliance.

Reactive approaches during the SDLC can lead to vulnerabilities and disregard for user privacy. However, integrating privacy into the early stages of the SDLC offers a multitude of benefits for both end-users and developers. Proactive privacy considerations help on system designs that are more respectful of user data and inherently less susceptible to privacy breaches. Aligned to this concept of considering privacy-based strategies since the beginning of the SDLC, DevPrivOps [1] based on a PQ model enables a privacy evaluation throughout the entire lifecycle, encompassing local unit tests during development to comprehensive system-wide tests after deployment (see Figure 1). This strategy, often referred to as shift-left testing, ensures that privacy is not an afterthought but rather a core consideration throughout development.

While Privacy-enhancing Technologiess (PeTs) are actively being developed, their application during the SDLC still has a critical gap when evaluating their effectiveness. Quantifying the privacy level of services is a promising approach to bridge this gap. The DevPrivOps methodology based on a PQ model takes advantage of a mechanism to measure and assess the privacy risks associated with a software system. The privacy level can be quantified during various stages of the SDLC to identify potential privacy leaks, quantify the level of privacy protection or risk offered by the system, and guide developers towards privacy-preserving design choices. Furthermore, this approach considers the PQ across multiple services operating together and processing data from various sources. This comprehensive evaluation is crucial in today’s increasingly interconnected software landscapes, where user data often flows across multiple services and systems.

2.2. Understanding Data Privacy Categorization

The current vulnerabilities affecting personal data make it imperative to comprehend the value of data, to evaluate privacy-protection systems and treat data more adequately and effectively.

Different approaches are being proposed in the literature to understand the privacy value of some data. Some of these approaches focus on developing taxonomies, ontologies, and vocabularies that evaluate data types and their relationships based on data categories.

Different ontologies are being proposed providing a formalized knowledge representation. It acts as a shared vocabulary for entities, concepts, and their relationships within a specific domain. This unified model goes beyond simply assigning attributes to data sources. These ontologies provide a structured representation of privacy requirements, helping to differentiate them from security requirements and prevent privacy violations [9,10]. Privacy ontologies are crucial tools when addressing privacy concerns in various domains. The majority of the authors aim at proposing privacy ontologies to ensure legal compliance, to simplify the process of understanding privacy policies, or to analyze the usage of data processors that often neglect or even misuse personal data [11].

However, these works are limited to specific scenarios or they have a legal basis with more generic perspectives. Gharib et al. [11] reviewed the literature and identified that while several studies emphasize the importance of integrating privacy considerations throughout the system-design process, a lack of a comprehensive ontology that captures all the main privacy concepts and relationships persists. Trying to address this gap, the authors proposed a novel privacy ontology that incorporates the core privacy concepts and relations, considering the following dimensions: organizational, risk, treatment, privacy. In their ontology, they referred data as personal or public. They do not explore the concept of PII beyond personal or non-personal data within a privacy ontology.

A key challenge remains: How do we better define personal data from a technical perspective? A new level of dynamism is required to better understand the level of privacy required for different types or categories of data. Essentially, to protect digital privacy in current systems, we argue for the need to focus on dynamic data categorization, which can handle various data formats, storage methods, and processing techniques, including data inference or derivation. The risk of data inference arises when seemingly benign data elements, whether analyzed independently or through data fusion, enable the derivation of privacy-sensitive information pertaining to an entity.

By assigning privacy categories to the extent of data collected from various sources and types, data can be handled appropriately. Proper personal data categorization enhances transparency, risk assessment, and decision-making regarding appropriate measures, adjusting the required protection level, reducing data collection to only what is necessary for the purpose, and adjusting retention periods. Furthermore, privacy-based categories are essential to an agile PQ model applied in the SDLC, as demonstrated in the DevPrivOps methodology.

These benefits of a data-categorization method have generated widespread interest in the literature for exploring potential categories across different application domains (see Table 1).

Some authors focused on specific scenarios, such as Sun and Xu [12] in social network scenarios, but developed data-categorization models that can be generalized to multiple scenarios. One such example is the work of Oh [3], where the focus was on how people categorize information files. The study identified three different types of personal digital information categorizers based on three types of mind structures: rigid categorizers, fuzzy categorizers, and flexible categorizers. To gather data, a questionnaire-based approach was conducted with each of 18 participants.

Focusing on autonomous vehicles, Mlada et al. [15] evaluated the riskiness of personal data based on four general criteria: the nature of the controller, public availability of data, scope/amount of data, and use of data processors. The result of this evaluation affects the calculation and magnitude of the final value, denoted by R. The main criteria used to determine this value are the evaluation areas, possible outcomes, and an additional evaluation for individual evaluation coefficients. The resulting value is then categorized into four different types of evaluation.

In the field of data classification, Rumbold and Pierscionek [16] focused on big data scenarios and classified data into six different categories, and assigned them a sensitivity scale by proposing a spectrum. However, some studies prefer to generalize the categorization model rather than focusing on specific scenarios. Milne et al. [13] proposed a classification system for personal information based on the perceived risk level associated with different data types. Researchers identified four types of risks—monetary, social, physical, and psychological—and grouped data into these categories. They also collected data that were aggregated into six clusters, allowing them to consider direct and indirect data. Moreover, Oh and Belkin [14] focused on data that are difficult to categorize, such as ambiguous or anomalous information. Ambiguous information can be placed into multiple categories, while anomalous information does not fit into the existing organizational structure.

The studies presented in this context focus on not providing a static evaluation, concentrating on more dynamic and broad data interpretations, to understand whether it is possible to infer personal data from diverse types of other data.

Oh and Belkin [14] proposed a detailed study aimed at tackling the challenge of categorizing difficult data. Although the study did not cover all the issues related to this problem, it provided a valuable research direction to explore further. Furthermore, other studies focused on proposing a classification framework while considering various parameters to refine the categorization results.

When considering data categorization, it is important to assess how the proposed approaches were tested by the different studies mentioned. The majority of these studies focused on conducting interviews with real people to understand their preferences. This approach aligns with the concept of privacy, as individuals have the right to decide what information should be protected and what should not.

In addition, there is a lack of practical tools to categorize data. PII-Codex [17], a significant advance in the implementation of a privacy-based categorization framework, is a comprehensive tool designed for the detection, categorization, and severity assessment of PII. It considers extensive theoretical, conceptual, and policy research on PII categorization and severity assessment, and is integrated with PII detection software.

However, the current literature lacks analytical studies that can provide a deeper understanding of the impact of data categorization.

3. Proposed Privacy-Sensitive Data Categorization

After a careful review of a representative set of data-categorization approaches, we found value in evolving the existing categories, by considering a broader data-categorization framework denoted by PsDC focusing on technical details. In this section, we describe the strategy of this data-categorization approach, and describe the clearly identified data items that are relevant for pervasive, data centric services. We also aim to present a structure that can be enhanced with other data types, with the capability to evolve and to be fitted to novel scenarios and applications.

To better understand how to properly categorize data based on privacy aspects, we manually analyzed the features of several datasets and selected 11 representative datasets containing data that potentially reveal individuals’ private information. By referring to existing data categories from the literature, legal regulations, and standards, we found that we could not accommodate all possibilities for data categorization. This is due to the potential for data inference or derivation, even when there is no apparent linkability in multiple datasets. For this reason, we selected a set of existing data categories from existing literature, regulations, and standards, and organized them according to generically addressing needs of datasets to properly categorize data based on private aspects.

Moreover, PsDC puts the entity data owner at the center of the process, and their interest in classifying data as a mean to both understand impact and manage risk. Note that the entity can be an individual, a group of people (e.g., family) or a device (e.g., car). Focusing on user-centric approaches, we consider the entity in this case an individual as a user of the current online permanent services. We consider two main categories as possible data categorization of the user: all data that directly identify a user and its data from behavioral attitudes that include data from social interactions. Moreover, we add a dynamic view by considering data tags and data attributes. This novel approach provides PsDC with an engineering dimension to analyze data and understand their relationships, including the derivation of private data from other data.

The PsDC framework utilizes a hierarchical structure for data categorization with two primary categories. Data that directly identify individuals such as name, age, or usernames belong to the broader identifier category. This category subdivides into different subcategories including human, biological, and devices. Following the idea of the tree structure, each of them also subdivides into more specific data categories. Thus, we capture a high dimension of data that directly identify individuals; the human category refers to data that directly identify an individual from itself; intuitively biological refers to data from biological sources such as DNA, and devices that provide that from direct usable devices from the user.

The other primary category pertains to behavior analysis, which reflects individual actions and interactions. This category encompasses data from social and online interactions, transactional behavior (including shopping patterns or legal activities), digital footprints (such as social media posts), and actions that can be habitual or singular. This category is crucial for capturing data that, while not directly identifying individuals, can still compromise their privacy. Rather than analyzing behavior data and then attempting to categorize them as direct data, we proactively identify behavior data as a broader category to categorize data as they are obtained.

The “identifier” category draws from recognized standards and a thorough review of existing literature. The “behavioral” category, in contrast, offers a novel perspective on data classification, recognizing that behavioral data is not easily captured by conventional, static categorization. Furthermore, we have incorporated tags and attributes to provide a dynamic viewpoint and enhanced control over the data-classification process.

Each category delves further into sub-levels, with the final level representing the most specific data classifications within the hierarchy (see Table 2). This final level offers the most granular detail and allows for a wider range of possible values associated with each data type.

Data itself is not the only factor influencing the potential to identify an individual. We argue for the necessity to integrate dynamic evaluations, to achieve a deeper understanding of the relationship between data points, and to facilitate the detection of potential inferences of personal data. For this reason, the framework also considers data tags (see Table 3) and data attributes (see Table 4). Tags provide an extra dimension to data categorization, allowing for the identification of specific contexts, for instance, health-related or sensor-generated data. Attributes, conversely, delineate data characteristics such as their format or stage of development.

Regarding data tags, our proposed data-categorization framework considers dynamic tags to categorize data. These tags can be directly assigned to data, resulting as a data categorization itself or complement the categorization provided by the first described tree. Data tags resulted from the diverse ways to obtain data, data representations, and interpretations.

Data tags encompasses data from health sources that can be medical information or wellbeing that consider health data, skills, payment types, frequency, sensing, spatial, temporal, logical, or statistics.

The medical tag encompasses anything related to an individual’s diagnostics, public disorders such as a pandemic situation, or symptoms. Wellbeing tags encompass data that can be from non-medical sources; however, they are related to health aspects such as blood pressure captured by a sensing device typically denoted as well-being data.

Furthermore, temporal and spatial data consider the high inference possibilities for private data from non-private data. The different representations of temporal and spatial data are not addressed in the remaining categorizations. In addition, there is a critical need to understand these representations even more during pattern recognition, by considering a huge set of data representations for each one.

Data attributes further refine the categorization process, enabling a deeper understanding of data characteristics and their potential to reveal personal information. This includes aspects such as data storage format (organization, ciphered, or plaintext), maturity, metadata, direct or indirect data, and the presence of URIs or descriptive texts.

These attributes provide an additional layer of data categorization, giving an extra understanding of data “value”. Plaintext or encrypted significantly impacts observability and inference possibilities. Plaintext data are readily accessible for analysis, whereas encrypted data require decryption before analysis. Similarly, anonymized data offer a certain level of protection, but they still carry associated privacy risks, such as the risk of re-identification.

Furthermore, metadata associated with data can be categorized based on their origin (social, technical, operational, or business). This metadata categorization provides an additional layer of information that can contribute to inferences about personal data. These formats further refine data categorization and highlight potential privacy concerns associated with different data types.

Finally, another attribute that should be highlighted is the string. While the remaining categories, tags, and attributes refer to the data themselves, this attribute subdivides into three subattributes that can be processed to obtain more data and enable subsequent categorization. A simple example is when considering the name “Catarina”, which is categorized as “Identifier/Human/Subject”. However, if the dataset field contains “My name is Catarina”, we can apply the categorization with the attribute “String/Description_Text” since it also reveals private data, allowing us to then identify the name within the string.

Descriptions for all categories are detailed in Table 2 for the most specific categories of the broad categories “identifier” and “behavior”; Table 3 for the data tags for the categorization model; and Table 4 for the data attributes for the categorization model.

Despite the proposed advancements in data privacy categorization in our proposed framework, we found value in improving the approach, adding one more perspective: individual’s privacy expectations and privacy influences. When considering individuals as the entity in the data generator in the proposed framework, consider that privacy preferences and influences are crucial due to the high subjectivity associated with the privacy meaning and decisions.

Privacy preferences and influence properties within PsDC provide our framework with high dynamism through the use of different weights for different data categories. Privacy is inherently subjective, with users possessing diverse notions of what constitutes personal data. PsDC ensures that individuals can define their own privacy preferences, including expectations and influence.

This user-centric approach allows individuals to assign weights to different data categories, reflecting their personal perception of a privacy violation (perceived risk). For instance, a user might consider biometric data to be more sensitive than browsing history, and would assign a higher weight to the “biometric data” category within the framework. These weights can then be used to dynamically calculate the overall privacy level associated with data access or processing activities.

Finally, the PsDC, presented as a simplified data-categorization model, is structurally hierarchical rather than a single mathematical function. However, we are developing two functional models: one for automated data classification using PsDC, employing semantic and stream features to cluster and categorize data, particularly effective in IoT contexts; and another for privacy quantification, which utilizes PsDC categories, sensitivity levels, and behavioral data to estimate privacy risks within data streams. These ongoing developments aim to provide a more dynamic and quantifiable approach to data privacy.

4. Applying PsDC in DevPrivOps Use Cases

DevPrivOps [1] is a novel methodology for software development that puts privacy as a primary requirement and not an afterthought. To achieve this, it is necessary to develop sufficient tools that help developers to produce privacy-enhanced applications, but also means the users track and understand how their data are gathered, processed, and used.

PsDC can be understood as a machine-readable and human-understandable file containing mandatory privacy fields, such as data type, tags, attributes, associated risk, and personalized fields to include user preferences. This file can serve as input for PQ, evaluating whether the software meets the required privacy standards.

Recognizing the inherently personal nature of privacy, we refrain from assigning fixed sensitivity levels to categories. Instead, we offer a framework, the PsDC, that empowers users or privacy specialists to define the appropriate sensitivity level for each category. We are, in fact, exploring the use of ML models to estimate the sensitivity of each category based on a user’s historical behavior. However, this aspect of our work is currently in its developmental stages.

As detailed in Section 2, the proposed categories were designed to serve as inputs for the privacy quantification component. As previously mentioned (in Section 1), an automated categorization paradigm, founded upon ML principles, is currently undergoing active development. This endeavor draws inspiration from the seminal work [5], and integrates both stream and semantic feature analysis to establish a sophisticated metric for data stream similarity. By leveraging this nuanced similarity assessment, we aim to cluster previously uncategorized data within our curated and labeled data repository. Subsequently, category and label propagation from the established data streams to the novel ones is facilitated. The aforementioned inspirational work demonstrates promising outcomes in the context of IoT stream categorization. We anticipate analogous results within our domain, given the shared theoretical underpinnings of the respective models.

It is crucial to acknowledge that while several PeTs exist, with differential privacy being the most prominent, their applicability is often limited. These methods primarily focus on minimizing fingerprinting and fall short when considering broader concerns such as profiling and indirect inference. Research by [18,19,20] has demonstrated that, despite its potential to degrade ML accuracy, the effectiveness of differential privacy is highly dependent on the specific dataset and model. In fact, some models exhibit improved accuracy when trained on anonymized data compared to the raw dataset. Another significant study by [21] highlights the substantial computational cost associated with differential privacy, particularly for large datasets, often outweighing its privacy-enhancing capabilities.

Furthermore, with the rapid advancements in Large Language Models (LLM), and their remarkable ability to dissect and navigate complex, multi-dimensional datasets, the potential for their use in PII inference becomes a tangible concern. While presently, these models are primarily optimized for coherent text generation within a request-response framework, the fundamental architecture of LLMs, often built upon variations of the transformer layer, could theoretically be adapted for PII inference given access to diverse data sources. The current unknown lies in the scale of training and the necessity for meticulously curated datasets to ensure a high degree of accuracy in this specific application. While it is conceivable to leverage the internal workings of LLMs to implement our categorization method, this approach presents several challenges: firstly, the computational resources and dataset sizes needed for effective performance remain uncertain; secondly, the suitability of pre-trained models for this specific application is questionable; and thirdly, our current resources do not permit such extensive exploration. Furthermore, our developing categorization model uniquely combines stream-based features with textual analysis. This approach offers a distinct analytical perspective while requiring significantly less computational power.

Considering these observations, it is reasonable to conclude that privacy categorization remains an emerging field with immense potential to significantly enhance privacy for the average user.

This approach contributes to various aspects of privacy-conscious software development, including user-centric design interfaces, enhancing the possibility of informed consent for data collection, providing users with greater control over their personal data, autonomous mechanisms for data-breach detection, and finally, considering non-human users that collect individual data, such as smart devices.

4.1. Privacy Quantification with PsDC in DevPrivOps

Employing a data-categorization framework, researchers, software developers, and privacy engineers gain a deeper understanding of data-sensitivity levels across diverse scenarios. This enhanced understanding empowers the development of targeted privacy-preserving measures tailored to specific data categories and potential privacy risks. This targeted approach optimizes resource allocation and ensures that sensitive data receive the necessary level of protection.

PsDC is fundamental to PQ within the context of DevPrivOps. On the left side of the DevPrivOps lifecycle, PsDC enables the assignment of different weights to data, which can then be considered during privacy calculations. Additionally, the left side of the lifecycle also greatly benefits from PsDC. When combining multiple data sources, it is possible to infer PII. By using a dynamic, standardized, broader set of data categories, we can provide a more accurate data categorization that accounts for the dynamic nature of data. This allows for proactive detection of PII data inference during PQ.

In addition, at the right side of the DevPrivOps, we can also take advantage of PsDC in microservice architectures where services are distributed, deployed, and interact with one another. A PQ framework can calculate the possibility of a privacy breach when considering only the service (left side) and then we can restrict the communication between different services based on the output of the quantification. This means that the output of the PQ can work as a threshold to enable services communication.

Research on the relationship between PsDC and potential privacy risks is crucial to better determine the PQ result. Assigning a privacy-based category with potential risks increases the likelihood of effective risk mitigation. Moreover, having a clear understanding of the potential privacy risks when a data breach occurs associated with a privacy-based data type can help entities or organizations respond effectively and minimize the impact. Note that privacy is a very subjective concept, which makes it imperative not to assign static sensitivity levels to categories but also to provide the framework (in this case, the PsDC) for users or privacy specialists to assign the privacy level for privacy quantification.

Finally, it is also possible to consider using PsDC during PQ, not only as a measure of privacy implications but also as a pre-evaluation for system privacy requirements. Proactively quantifying the privacy level helps developers and engineers to continue their tasks with a greater awareness of privacy.

4.2. Network Data Communication with PsDC

PsDC applies to current network topologies where data are processed and analyzed closer to the data sources. Appropriately classifying data at the network edge, data processors can prioritize security measures for the most sensitive data categories. Thus, there is a higher risk of privacy breaches providing stronger protection mitigating potential vulnerabilities. Conversely, low-risk data can benefit from more relaxed security measures, optimizing resource allocation within the smart home system.

Moreover, because edge data are often diverse and can be combined with external sources, it is essential to comprehend the evolving privacy implications. A dynamic approach enables continuous assessment of data sensitivity at the edge, understanding the privacy requirements for data inference or combinations and implementing appropriate security controls. For instance, with network flow streams, log analysis and data categorization based on various sensitivity levels become crucial to determine when PII is being obtained from non-critical data. By detecting such situations, software can be adapted to avoid this issue, ensuring a high level of compliance and minimizing legal risks. PsDC assists in this by offering a wide range of different PII categories in a structured hierarchy that places the user at the center of the categorization. The additional dynamic data types and subtypes introduce a dynamic nature to the categorization.

Continuous monitoring advocates for a dynamic approach to categorizing data to identify when personal data might be inferred from seemingly low-risk data. Systems should be equipped with a module to dynamically capture network traffic, analyze it, categorize data based on PsDC, and then send the completed machine-readable file to the component responsible for quantifying privacy.

A concrete scenario where PsDC can be applied is in smart environments, including smart cities, homes, or factories. PsDC can be utilized when integrating a new device into an existing network of interconnected smart devices. Based on the device’s function, the data it collects should be categorized according to PsDC categories. The device should then only be enabled in cases where user privacy is not compromised, leveraging the analysis of the PQ model’s output.

In addition, the categories of data processed by the smart device help select the most appropriate device. In cases where a device has multiple functions but we only need one, for example, we can use privacy-based categories to identify and restrict data storage to only the required data.

In this context, PsDC will be applied during the onboarding of network services within the scope of a European project named RIGOUROUS (for more information, see https://rigourous.eu/. In this project, we are considering the integration of a machine-readable and human-understandable file as a privacy manifest file sent from the services to the network infrastructure. This file should specify the privacy requirements to ensure a privacy-based service instantiation.

4.3. Human-Centric Design Interfaces with PsDC

A data-categorization framework with easily understandable data categories can help users comprehend and select their preferences regarding data processing and analysis based on the data categories and their sensitivity. User-friendly interfaces allowing users to intuitively understand the categories of data collected and express their preferences contribute to addressing the lack of data flow transparency. Moreover, focusing on a proper data categorization based on PsDC it is possible to not concentrate only on the extensive and hard to understand privacy policies, increasing the occurrence of informed consent.

To achieve this, we first need to conduct a user research study to understand how to effectively depict the information about the data collected and their corresponding categories. It is crucial to present data categories, types, and subtypes in a simple and intuitive manner that any user can comprehend to understand the privacy risk. However, this is not the sole focus. After data categorization, we intend to produce a value as a result of PQ. The user research should also explore the most suitable way to present this result. A possible option is a privacy scale (based on numbers, colors, or letters), but this should be further investigated by gathering opinions from diverse users to incorporate their feedback.

Furthermore, PsDC is also useful when the user is not an individual, like a car or a device. In such cases, the system should have a suitable module to process data categorization. A machine-readable file is essential for this process. Additionally, automatic mechanisms can be employed to dynamically identify data breaches, and PQ can produce output representing the privacy violation.

5. Conclusions

This work introduces a novel data-categorization model, named PsDC, designed to align with DevPrivOps principles and facilitate privacy-quantification models. Data categorization is fundamental for implementing robust data privacy measures. Our proposed categories adopt a dynamic approach, accounting for data inference from processed data, and user expectations and preferences. Our model also offers a broader spectrum of categories than those found in legal frameworks. This avoids the static, generic categorizations common in legal contexts, which often rely on a binary division between non-personal and personal data.

Integrating PsDC into the PQ stages of the DevPrivOps methodology offers significant advantages by elucidating the “value” of data derived from user behavior, interactions, and preferences. This allows for the proactive identification of PII as it emerges from non-PII data. Furthermore, this standardized approach promotes consistent decision-making regarding data-handling practices across diverse departments and teams, minimizing the risk of data misclassification.

The PsDC framework emphasizes a simple, intuitive design and nomenclature to ensure ease of understanding and implementation for data owners, users, and software developers. Future work will involve applying PsDC to the use cases described. We are also actively developing two complementary models: one for the automated categorization of raw data using PsDC, and another for PQ that leverages these categories to accurately assess the privacy level of data streams. We anticipate publishing these models in a subsequent publication.

Author Contributions

Conceptualization, C.S.; methodology, C.S.; validation, C.S., J.P.B. and P.S.; formal analysis, C.S.; investigation, C.S.; writing—original draft preparation, C.S.; writing—review and editing, C.S., J.P.B. and P.S.; visualization, C.S.; supervision, J.P.B. and P.S.; project administration, J.P.B.; funding acquisition, J.P.B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by FCT-Fundação para a Ciência e Tecnologia, I.P. by project reference UIDB/50008, and DOI identifier 10.54499/UIDB/50008. This work has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101095933 (RIGOUROUS project).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Silva, C.; Cunha, V.A.; Barraca, J.P.; Salvador, P. Privacy-Based Deployments: The Role of DevPrivOps in 6G Mobile Networks. IEEE Commun. Mag. 2024, 62, 66–72. [Google Scholar] [CrossRef]
Malkin, N. Contextual Integrity, Explained: A More Usable Privacy Definition. IEEE Secur. Priv. 2023, 21, 58–65. [Google Scholar] [CrossRef]
Oh, K.E. Types of personal information categorization: Rigid, fuzzy, and flexible. J. Assoc. Inf. Sci. Technol. 2017, 68, 1491–1504. [Google Scholar] [CrossRef]
RIGOUROUS. Design Plan of the Multi-Domain Automated Security Orchestration, Trust-Management, and Deployment; UWS: Paisley, Scotland, 2023. [Google Scholar] [CrossRef]
Antunes, M.; Gomes, D.; Aguiar, R.L. Towards IoT data classification through semantic features. Future Gener. Comput. Syst. 2018, 86, 792–798. [Google Scholar] [CrossRef]
Mason, S. Exploring Privacy from a Philosophical Perspective: Conceptual and Normative Dimensions. In Human Privacy in Virtual and Physical Worlds: Multidisciplinary Perspectives; Springer Nature: Cham, Switzerland, 2024; pp. 23–45. [Google Scholar]
Piper, D. Data Protection Laws of the World. 2025. Available online: https://www.dlapiperdataprotection.com/ (accessed on 6 December 2024).
Hoepman, J.H. Privacy Is Hard and Seven Other Myths: Achieving Privacy Through Careful Design; MIT Press: Cambridge, MA, USA, 2021. [Google Scholar]
Gharib, M.; Giorgini, P.; Mylopoulos, J. An ontology for privacy requirements via a systematic literature review. J. Data Semant. 2020, 9, 123–149. [Google Scholar] [CrossRef]
Pandit, H.J.; Esteves, B.; Krog, G.P.; Ryan, P.; Golpayegani, D.; Flake, J. Data Privacy Vocabulary (DPV)—Version 2. In Proceedings of the 23rd International Semantic Web Conference (ISWC 2024), Baltimore, MD, USA, 11–15 November 2024; Volume 1, pp. 1–24. [Google Scholar] [CrossRef]
Gharib, M.; Mylopoulos, J.; Giorgini, P. COPri-A core ontology for privacy requirements engineering. In Proceedings of the Research Challenges in Information Science: 14th International Conference, RCIS 2020, Limassol, Cyprus, 23–25 September 2020; Proceedings 14. Springer: Berlin/Heidelberg, Germany, 2020; pp. 472–489. [Google Scholar]
Sun, Q.; Xu, Y. Research on Privacy Concerns of Social Network Users. In Proceedings of the ICCC, Prague, Czech Republic, 16–20 September 2019; pp. 1453–1460. [Google Scholar] [CrossRef]
Milne, G.R.; Pettinico, G.; Hajjat, F.M.; Markos, E. Information sensitivity typology: Mapping the degree and type of risk consumers perceive in personal data sharing. J. Consum. Aff. 2017, 51, 133–161. [Google Scholar] [CrossRef]
Oh, K.E.; Belkin, N.J. Understanding what personal information items make categorization difficult. Proc. Am. Soc. Inf. Sci. Technol. 2014, 51, 1–3. [Google Scholar] [CrossRef]
Mlada, M.; Holý, R.; Jirovský, J.; Kasalický, T. Protection of personal data in autonomous vehicles and its data categorization. In Proceedings of the SCSP, Prague, Czech Republic, 26–27 May 2022; pp. 1–5. [Google Scholar] [CrossRef]
Rumbold, J.M.; Pierscionek, B.K. What are data? A categorization of the data sensitivity spectrum. Big Data Res. 2018, 12, 49–59. [Google Scholar] [CrossRef]
Rosado, E.J. PII-Codex: A Python library for PII detection, categorization, and severity assessment. J. Open Source Softw. 2023, 8, 5402. [Google Scholar] [CrossRef]
Slijepčević, D.; Henzl, M.; Klausner, L.D.; Dam, T.; Kieseberg, P.; Zeppelzauer, M. k-Anonymity in practice: How generalisation and suppression affect machine learning classifiers. Comput. Secur. 2021, 111, 102488. [Google Scholar] [CrossRef]
Díaz, J.S.P.; García, A.L. Comparison of machine learning models applied on anonymized data with different techniques. In Proceedings of the 2023 IEEE International Conference on Cyber Security and Resilience (CSR), Venice, Italy, 31 July–2 August 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 618–623. [Google Scholar] [CrossRef]
Silva, C.; Barraca, J.P.; Salvador, P. Evaluating the Effectiveness of Differential Privacy Against Profiling (Submitted). In Proceedings of the 2025 The International Conference on Consumer Technology (ICCT-Europe 2025), Algarve, Portugal, 28–30 April 2025. [Google Scholar]
Oprescu, A.; Misdorp, S.; van Elsen, K. Energy cost and accuracy impact of k-anonymity. In Proceedings of the 2022 International Conference on ICT for Sustainability (ICT4S), Plovdiv, Bulgaria, 13–17 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 65–76. [Google Scholar] [CrossRef]

Figure 1. DevPrivOps Lifecyle (adapted from [1]).

Table 1. State-of-the-art on data privacy categories.

Ref.	Applicability	Sensitivity Labels
[12]	Social Networks	Personal attributes; Behavioral attributes; Social attributes
[13]	General	Financial information; Basic demographics; Personal preferences; Contact information; Secure identifiers; Community interactions
[14]	General	Ambiguous; Anomalous
[3]	Information files	Rigid; Fuzzy; Flexible structures
[15]	Autonomous vehicle	No significant information; Some significant information; Potential identifications of persons; Direct identification of persons
[16]	Big Data	Non-personal data; Human–machine interactions; Human demographics, behavior, thoughts and opinions; Readily apparent human characteristics (unprotected); Readily apparent human characteristics (protected); Medical or healthcare data

Table 2. Data-categorization model based on identifier and broader behavior categories.

Level 0	Level 1	Level 2	Description	Example
Identifier	Human	Online	Identifies individuals using online identifiers	Usernames, email addresses
		Object	Identifies individuals through objects associated with them	Vehicles, clothing
		Subject	Identifies individuals based on their data	Name, Age
		Criminal record	Identifies individuals based on their criminal history	Arrest records, conviction history
		Gender	Identifies individuals based on their gender	Male, female, non-binary
	Biological	Biometric	Identifies individuals based on unique biological traits	Fingerprints, facial recognition
		Genetic	Identifies individuals based on their genetic information	DNA sequences, genetic markers
	Devices	Address	Identifies devices based on their location	IP address, MAC address
		Characteristics	Identifies devices based on their technical specifications	Operating system, hardware configuration
Behavioral	Interactions	Connections	Identifies individuals based on their social networks and relationships	Friends, family, colleagues
		Communication	Identifies individuals based on their communication interactions	Messages, phone calls
		Collaboration	Identifies individuals based on their collaborative activities	Joint projects, shared documents
		Relationships	Identifies individuals based on their personal and professional relationships	Relationships, friendships
		Beliefs	Identifies individuals based on their beliefs and values	Political views, religious beliefs
	Transactional	Shopping	Identifies individuals based on their purchasing behavior	Purchase history, shopping cart data
		Financial	Identifies individuals based on their financial transactions	Bank transactions, credit card usage
		Legal	Identifies individuals based on their legal activities	Contracts, legal documents
	Digital Footprint	Content Creation	Identifies individuals based on the content they create online	Blog posts, social media posts
		Content Consumption	Identifies individuals based on the content they consume online	Website visits, video views
		Online Behavior	Identifies individuals based on their online activities	Likes, shares, comments
	Actions	Single	Identifies individuals based on their individual actions	Clicking a link, making a purchase
		Habit	Identifies individuals based on their repeated actions	Daily routines, frequent purchases

Table 3. Data tags for data-categorization model.

Level 0	Level 1	Level 2	Description	Example
Medical	Diagnosis	Physical	Physical health condition	Heart disease, diabetes, broken bone
		Mental	Mental health condition	Depression, anxiety, bipolar disorder
		Healthy	General health state	Physically fit, mentally sound
	Public		Public health concerns	Infectious diseases, pandemic
	Symptoms		Physical or mental symptoms	Fever, cough, fatigue, insomnia
Wellbeing	Physical		Overall physical health and condition	Blood pressure, weight, cholesterol levels
	Mental		Cognitive and emotional well-being	Stress levels, mood swings, cognitive function
	Emotional		Ability to understand, manage, and express emotions effectively	Empathy, self-awareness
	Social		Quality of relationships	Social interaction frequency, community involvement
Skills	Technical		Technical abilities	Programming, data analysis
	Soft		Interpersonal and communication skills	Leadership, teamwork, problem-solving
	Domain Specific		Industry-specific knowledge	Medical knowledge, legal knowledge, financial knowledge
Payment	Financial		Monetary transactions	Credit card payments, bank transfers, salary
	Non-Financial		Non-monetary transactions	Vouchers, miles
Frequency	Quantitative		Numerical frequency	Number of times per day, week, month
	Qualitative		Descriptive frequency	Often, sometimes, rarely
	History		Past occurrences	Previous purchases, past medical records
Related	Absolute		Independent value	Weight, height, temperature
	Relative		Comparative value	Percentage change, ratio, rate
Sensing	Human		Human-generated data	Interviews, user feedback, exercise activity.
	Environmental		Environmentally generated data	Weather data, air quality, noise levels
	Surveillance		Surveillance-generated data	CCTV footage, traffic cameras
Spatial	Locality		Geographic location	City, country, GPS coordinates
	Relative		Relative spatial relationships	Near, far, above, below
	Absolute		Precise spatial coordinates	Latitude, longitude, altitude
	Online		Digital location	Website URL, social media profile
Temporal	Timestamp		Exact point in time	22/11/2023 12:34
	Interval		Duration of time	1 h, 2 days, 3 weeks
	Numeric		Numerical representation of time	Age, time in seconds, milliseconds
	Date		Calendar date	22/11/2023
	Year		Calendar year	2023, 2024
	Time		Time of day	12:34:56
	Time with time zone		Time with time zone information	2023-11-22 12:34:56 CET
Logical	Boolean		Binary values	Yes/No, True/False, 1/0
Statistics			Statistical measures	Mean, median, mode, standard deviation

Table 4. Data attributes for data-categorization model.

Level 0	Level 1	Description	Example
Organization	Structured	Data with a predefined format	CSV, Excel, SQL databases
	Semi-structured	Data with some structure but not strictly defined	XML, JSON
	Unstructured	Data without a predefined format	Text documents, images, audio, video
Format	Ciphered	Encrypted data	Encrypted emails, sessions keys
	Plaintext	Unencrypted data	Text documents, CSV files
Maturity	Anonymized	Data with personal information removed	Anonymized surveys, census data
	Raw	Unprocessed data	Sensor data, log files
	Pseudonymized	Data with personal information replaced with identifiers	Pseudonymized medical records
Metadata	Social	Data about the social context of data	Author, publisher, copyright, photos data
	Technical	Data about the technical characteristics of data	File format, encoding, compression
	Operational	Data about the operational context of data	Creation date, modification date, access permissions
	Business	Data about the business context of data	Data owner, data usage policies
String	URI	Textual data representing a URI	URLs, hyperlinks
	Description_Text	Textual data describing something	Product descriptions, customer reviews
	Number	Numerical data	Sales figures, age, temperature

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Silva, C.; Barraca, J.P.; Salvador, P. Proactive Data Categorization for Privacy in DevPrivOps. Information 2025, 16, 185. https://doi.org/10.3390/info16030185

AMA Style

Silva C, Barraca JP, Salvador P. Proactive Data Categorization for Privacy in DevPrivOps. Information. 2025; 16(3):185. https://doi.org/10.3390/info16030185

Chicago/Turabian Style

Silva, Catarina, João P. Barraca, and Paulo Salvador. 2025. "Proactive Data Categorization for Privacy in DevPrivOps" Information 16, no. 3: 185. https://doi.org/10.3390/info16030185

APA Style

Silva, C., Barraca, J. P., & Salvador, P. (2025). Proactive Data Categorization for Privacy in DevPrivOps. Information, 16(3), 185. https://doi.org/10.3390/info16030185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu