- Research
- Open access
- Published:
Real-time trajectory privacy protection based on improved differential privacy method and deep learning model
Journal of Cloud Computing volume 11, Article number: 57 (2022)
Abstract
Accurate and real-time trajectory data publishing plays an important role in providing users with the latest traffic and road condition information to help in rationally planning travel time and routes. However, the improper publishing of location information and reverse analysis and reasoning can easily leak users’ personal information, which may threaten users’ privacy and lives. Owing to the inclusion of differential privacy model noise, privacy protection introduces inaccuracies in data publishing and validity. To improve the accuracy and usability of published data, we propose a data publishing method based on deep learning and differential privacy models for securing spatiotemporal trajectory data publishing. The method divides the trajectory data into two-dimensional grid regions, counts the density of trajectories at grids, performs a top-down recursive division of regions, and formulates rules for privacy budget allocation from multiple perspectives as recurrence depth increases. Furthermore, the method integrates spatiotemporal sequence data according to temporal order. Subsequently, it extracts temporal and spatial features of the data by the temporal graph convolutional network model for budget matrix prediction, adds Laplace noise to the regions, and evaluates the effect of differential privacy protection with the original data to protect trajectory data privacy. Experiments demonstrate that under the premise of satisfying ε-difference privacy, the query error and Jensen–Shannon divergence are smaller, the Kendall coefficient is more consistent, and the upper and lower limit values are more stable. Hence, the top-down division method achieves better results than those of the two traditional region division methods of the uniform grid and adaptive grid. The proposed method can be used to allocate the privacy budget more reasonably and achieve privacy protection of trajectories, which can be applied to a large amount of spatiotemporal trajectory data.
Introduction
Societal advancement and economic development are rapidly facilitating smart city planning. The widespread adoption of intelligent transportation systems, the internet of vehicles, and location-based service systems have increased the amount of data containing location information [1]. Owing to the desirable benefits of these modern applications, several “Internet+” travel methods have been proposed. Meanwhile, high-quality trajectory data publishing can deliver high-precision location-based services, provide users with accurate road condition information, and help users to schedule time and plan routes. However, inappropriate publishing of location information and inference of reverse analysis can easily breach users’ privacy (for example, by revealing the specific location and movement trajectory), which may endanger lives and property.
The differential privacy model achieves differential privacy protection by adding random noise to published location data [2], and the model has been widely used in the field of data publishing privacy protection. In fact, to protect users’ privacy, adding differential privacy model noise introduces a deliberate discrepancy between published location data and actual statistical values [3]. The trajectory data publishing of regional locations aims for accuracy and efficiency without compromising users’ privacy. Hence, it is essential to address the privacy protection problem of trajectory data publishing.
The meshing of 2D spatial regions and prediction of trajectory traffic are crucial in trajectory data distribution. Several models are currently used for trajectory traffic prediction, including the autoregressive integrated moving average model [4], support vector regression machine learning model [5], hidden Markov model [6], and partial neural network model [7]. However, these models ignore the spatial dependence and only consider the temporal dependence of the trajectory traffic, which cannot accurately predict the trajectory traffic in the region. To address the spatial dependence exclusion, a convolutional neural network has been introduced [8, 9]. The drawback of this attempt is that the convolutional neural network is most suitable for image processing rather than for complex spatial structures.
Recently, the rapid advances in graph neural networks have yielded desirable results in the fields of transportation and meteorology [10,11,12]. The graph convolutional neural network has a natural advantage in tasks involving complex spatial structures and can effectively capture the spatial dependence of regions while fully extracting spatial features. The grid division methods for 2D spatial regions are uniform grid (UG) [13], adaptive grid (AG) [14], and PrivTree divisions [15]. These methods are effective in dividing regions with a high time complexity but are unsuitable for on-time and effective publishing of trajectory data. In different regions, under-division and over-division problems may occur, and it is difficult to equalize noise errors, resulting in unreasonable privacy budget allocation, which inhibits differential privacy protection. To address these problems, we propose a differential privacy budget allocation method based on a temporal graph convolutional network (T-GCN) [16] to predict future privacy budget demands; we also propose a top-down division of regions for an accurate and rational allocation of the privacy budget. The main contributions of this study are summarized as follows:
-
(1)
We formulate the grid division conditions for regions, use a recursive algorithm to divide similar regions by top-down search, and design the rules for privacy budget allocation from multiple perspectives as the recursive depth increases; this approach reasonably allocates privacy budget for each location.
-
(2)
We use the T-GCN model to predict the privacy budget matrix for the future. Laplace noise is added in advance at the regional locations, and the trajectory noise data are published.
-
(3)
We evaluate our approach using Yonsei pedestrian trajectory data and Citi Bike datasets. The results show that the T-GCN prediction is more stable and better than other deep learning models. In addition, compared with other region division methods, the method of a top-down division of regions for privacy budget allocation and the addition of Laplace noise is relatively more robust for each metric in the horizontal and vertical comparisons of privacy protection effects. Moreover, the method is effective for differential privacy protection.
Related work
The nonlinear and uncertain characteristics of trajectory data make it difficult to avoid the interference of random events, emphasizing the significance of studying trajectory data publishing. Dwork [2] proposed a differential privacy model that adds Laplace noise to the region location for differential privacy protection. The method has rigorous mathematical proof and has been widely used in the field of data distribution. The UG division method [13] uniformly divides the 2D space into numerous grids, counts the trajectories of regional grids, and adds noise to achieve differential privacy protection. This division approach is simple and efficient while easily achieving under-division in the region, weakening the availability of published data. Meanwhile, the AG division method [14] is based on a UG division for each region and adds noise to achieve differential privacy protection. This division method better reflects the influence of data distribution characteristics on the division structure but ignores the over-division problem caused by the complex spatial structure. Furthermore, the PrivTree division publishing method [15] introduces a controlled deviation for deciding whether to perform quadtree division, eliminating the restriction on the predefined quadtree division depth. Yan et al. [3] proposed the spatiotemporal–long short-term memory (ST-LSTM) model for predicting the structural hierarchy of region locations by adding Laplace noise to achieve differential privacy protection. However, the model does not precisely provide the differential privacy budget, and the granularity is insufficiently high when assigning the privacy budget.
The traditional hidden Markov model for spatiotemporal trajectory prediction [6] is based on spatiotemporal density clustering, which analyzes the correlation between time and space to predict different distributions of spatiotemporal series data. The traditional grid region division ignores the sparsity and denseness of the grid, the addition of noise increases the query error, and the traditional prediction method is only suitable for smooth data, which fails to capture the hidden nonlinear features in the spatiotemporal sequence [17] (the trajectory sections of spatiotemporal trajectory data have certain interactions and correlations, and methods based on deep learning can use such data more effectively to fully explore the hidden linear and nonlinear features). To publish the trajectory data effectively, we combine deep-learning and differential privacy protection models, design and implement a scheme for dividing the grid region of spatiotemporal sequence trajectory data, predict the privacy budget matrix, and add Laplace noise to each region location to achieve differential privacy protection. Overall, the original trajectory data are protected effectively.
Deep learning model
Overview
Since 2006, deep learning has become an emerging area in machine learning research. Deep learning has become a popular research area because of developments in computing power. With rapid advances in deep learning, deep neural networks demonstrate excellent performance in discovering intricate structures in high-dimensional data and outperform traditional methods [18]. In addition to beating records in image recognition [19, 20] and speech recognition [21], deep neural networks have beaten other machine-learning techniques at predicting the activity of potential drug molecules [22]. Deep neural networks can now effectively capture the dynamic features of data and achieve excellent prediction results. For example, convolutional neural networks can be used for image classification, image segmentation, and object detection. Typical convolutional neural networks include ImageNet [23], ResNet [24], U-Net [25], and R-CNN [26]. Recurrent neural networks can identify relationships within time series data and predict future occurrences; LSTM [27] and gated recurrent unit (GRU) [28] are popular examples of recurrent neural networks. In the field of meteorology, convolutional and recurrent neural networks have been widely used to predict rainfall, wind speed, and radar echoes [29,30,31], and examples of such networks are convolutional LSTM (ConvLSTM) [29] and ST-LSTM [30]. In the field of transportation, researchers [11, 16, 32, 33] used graph networks and recurrent neural networks to predict traffic flow, and examples are T-GCN [16] and ST-GCN [11]. These examples demonstrate that neural networks have been used to achieve desired progress in prediction and classification tasks in several fields.
T-GCN
The T-GCN model [16] comprises the graph convolutional network (GCN) [34] and GRU [28]. T-GCN essentially captures the spatial dependence of the topology using GCN and captures the temporal dependence of the trajectory data using GRU.
Definition 1 Given a network graph G, we use an unweighted graph G = (V, E) to describe the topological structure of the network graph. V = {v1, v2, ⋯, vN × N} is the node at the grid region locations V = {v1, v2, ⋯, vN × N}, where N × N denotes the number of grid region locations; E is a set of edges. The adjacency matrix M represents the connection between grid region locations, \(M\in {R}^{N^2\times {N}^2}\); as the adjacency matrix represents the relationship between nodes adjacent to each other, the only elements are 0 and 1. The element is 0 if no link exists between nodes but 1 if a link exists.
Definition 2 Given a feature matrix X, we regard the trajectory information on the network as the attribute feature of the node in the network, expressed as \(X\in {R}^{N^2\times P}\) , where P represents the number of node attribute features (the length of the historical time series), and \({X}_t\in {R}^{N^2\times i}\) represents the trajectory density on each location at time i.
Thus, predicting the spatiotemporal trajectory can be regarded as learning a mapping function F given a graph G and a feature matrix X. Then, the trajectory information is obtained for the next T moments, as shown in the following equation:
where n is the length of the historical time series, and T is the length of the predicted time series step.
The internal structure of the T-GCN model is fully illustrated in Fig. 1a: the historical trajectory data are inputted into the model while the data of the future moment are outputted by the GCN model [34] to obtain spatial features as well as the GRU model [35] to obtain temporal features. Figure 1b depicts the graph convolution of the grid location information by the GCN model. Assuming that red grid is the central location, the GCN model can obtain the topological relationship between the central location and its surrounding green grids, encode the topological structure of the grid network and the attributes on the grids, and then obtain spatial dependence. Figure 1c illustrates the structure of a GRU cell, which can extract the temporal feature of a grid region at time t by taking the hidden status at time t − 1. In this case, the GRU model captures the current trajectory information while retaining the changing trend of historical traffic information; the model also captures temporal dependence. The model equations are expressed as follows, Eq. (2) represents the graph convolution process, and Eqs. (3), (4), (5), and (6) represent the GRU cell process.
where F(·) denotes the graph convolution process, X is a feature matrix, M is an adjacency matrix, and \(\hat{M}=M+{I}_{N\times N}\) is a Laplace matrix. \(\hat{M}={\overset{\sim }{D}}^{-\frac{1}{2}}\overset{\sim }{M}{\overset{\sim }{D}}^{-\frac{1}{2}}\) denotes the Laplace matrix normalization. \(\overset{\sim }{D}\) is a degree matrix, where \(\overset{\sim }{D}={\sum}_{j=1}^{N\times N}{\overset{\sim }{M}}_{ij},i=1,\cdots, N\times N\). W0 and W1 denote the weight matrix. ht − 1 is the output at moment t − 1, μt is an update gate, and rt is a reset gate at moment t. σ(·), Relu(·), and tanh(·) represent the activation function.
Summarily, the T-GCN model can capture the complex topology of grid regions through GCNs, and GRU cells can subsequently capture the dynamic changes in trajectory information. While the T-GCN model uses a GCN model with two layers to learn the spatial features of the grid region locations, it uses the GRU model to learn the temporal features of the trajectory data. Ultimately, the T-GCN model can sufficiently learn the spatial and temporal dependence to achieve trajectory prediction.
Differential privacy
The differential privacy model is effective for protecting privacy. The model primarily adds noise to the original data to prevent differential attacks, making it impossible for an attacker to identify specific samples in the dataset. If adding trajectory data to a location changes the information of that location, the distribution of the data would only change slightly according to the dynamic change in the trajectory data; thus, the technique is suitable for differential privacy protection.
Definition 3 ε-differential privacy [2]. For two adjacent datasets T1and T2with a Hamming distance of 1 (only one different record exists for both), for any output set S of algorithm A, if the probability Pr satisfies
the algorithm A satisfies ε-differential privacy, where ε is the privacy budget, the smaller its value, the more noise is added by algorithm A, the better the privacy protection.
Definition 4 Sensitivity [36, 37]. Given a query function f(·) with sensitivity to the L1-paradigm maximum distance between datasets T1and T2, the sensitivity is defined as:
Definition 5 Laplace mechanism [37]. The Laplace mechanism adds independent noise to the result of the query function f(·) to achieve differential privacy protection. f(T) represents the query result of the dataset T, and the addition of the noise φ can be denoted as A(T) = f(T) + φ, where φ is an independent identically distributed random variable obeying the Laplace distribution with a probability density function of \(\mathit{\Pr}\left[\varphi =x\right]=\frac{1}{2b}{e}^{-\frac{\left|x-\mu \right|}{b}}\);\(b=\frac{\Delta f}{\varepsilon }\), and μ is the location parameter, where μ = 0 in this study.
Definition 6 Serial combination property [38]. Given a set of randomized algorithms {A1, ⋯, An}, each algorithm satisfies ε-differential privacy for the dataset T, and the algorithms can achieve \({\sum}_{i=1}^n{\varepsilon}_i\)-differential privacy for the dataset T.
Definition 7 Parallel combination property [38]. Given a set of randomized algorithms {A1, ⋯, An}, each algorithm satisfies ε-differential privacy for the sub-data T = {T1, ⋯, Tn} in dataset T separately, and the algorithms can achieve max{εi}-differential privacy for dataset T.
Differential privacy data publishing based on UG and AG
Definition 8 Count matrix. Initialize the N × N 2D grid region and construct a matrix C to represent the counting of the region. The matrix element cntrc(r, c = 1, ⋯, N) indicates the cumulative number of trajectories. The matrix C is expressed as:
Model design by UG
UG [13] is a uniform division of the region, the method is given a differential privacy budget, and Laplace noise is uniformly added after releasing the data to achieve differential privacy protection. The addition of noise through UG division is the most rudimentary method, which essentially ignores future changes in trajectory data and only needs to publish noisy data before real data. The process is illustrated in Fig. 2 and outlined as follows:
-
(1)
The trajectory data Td, Td + 1 on day d and day d + 1 are counted.
-
(2)
Data Td, Td + 1 are generated into a fine-grained collection {Td, j, Td + 1, j | j = 1, ⋯, x} according to the time interval Δt.
-
(3)
The fine-grained data Td, j, Td + 1, j are divided uniformly into a 2D grid of m × m, where m is the granularity of division.
-
(4)
The count of the 2D grid region is determined to get the count matrix collection {Cd, j, Cd + 1, j | j = 1, ⋯, x}.
-
(5)
Laplace noise is distributed for each region location, i.e., \(cn{t}_{rc}^{\ast }= Lap\left(\frac{S}{\upvarepsilon}\right)\), where ε is the privacy budget, and S is max{cntrc}.
-
(6)
The count matrix Cd + 1, j of the present data is calculated, and Laplace noise is added.
-
(7)
The data are published.
Model design by AG
AG [14] is the adaptive division of the region, and the trajectory data publishing idea of this method contains two layers. The first layer divides the 2D region into a matrix of an m1 × m1 size and adds Laplace noise according to the differential privacy budget αε(0 < α < 1) to get the noise v at each location. The second layer divides the grid according to the noise v by adaptively selecting the size of the region and then adaptively selecting a new region of m2 × m2 by adding noise again to get noise u at each location; reference [14] gives the method of adaptively selecting the region. Here, given a cell with a noisy count of N′, to minimize the errors, this cell should be partitioned into m2 × m2 cells, where m2 is computed as follows:
where [·] represents rounding down, (1 − α)ε is the remaining privacy budget for obtaining noisy counts for grids, and setting c = 10.
The original noise v and the noise u of the redivided region are then weighted to obtain a more precise noise, making the deviation of the location noise error smaller. The weighted average formula is defined as:
Finally, the noise is uniformly distributed to all grids as the final noise at each region location. The uniform distribution noise formula is defined as:
The difference between both methods is that the UG method considers all regions equally, whereas the AG method divides regions adaptively and adds Laplace noise. As the method requires the trajectory data of future moments to add noise, the data traversal is prohibited when publishing data, and the trajectory data published in the previous moment are used as the reference data of future moments to obtain publishing noise data. The process is illustrated in Fig. 3 and explained as follows:
-
(1)
The trajectory data Td, Td + 1 on day d and day d + 1 are counted.
-
(2)
Data Td, Td + 1 are generated into a fine-grained collection {Td, j, Td + 1, j | j = 1, ⋯, x} according to the time interval Δt.
-
(3)
The fine-grained data Td, j, Td + 1, j are divided uniformly into a 2D grid of m1 × m1; m1 is the granularity of division.
-
(4)
The count of the 2D grid region is calculated to get the count matrix collection {Cd, j, Cd + 1, j | j = 1, ⋯, x}.
-
(5)
Laplace noise is distributed for each region location, i.e., \(cn{t}_{rc}^v= Lap\left(\frac{S}{\varepsilon}\right)\), where ε is the privacy budget, and is max{cntrc}.
-
(6)
If the adaptive division condition is satisfied, the Laplace noise is redistributed for the new range of m2 × m2, i.e., \(cn{t}_{rc}^u= Lap\left(\frac{S}{\varepsilon}\right)\).
-
(7)
\(cn{t}_{rc}^v\) and \(cn{t}_{rc}^u\) are weighed to distribute the noise uniformly to all grid locations.
-
(8)
The count matrix Cd + 1, j of the present data is calculated, and Laplace noise is added.
-
(9)
The data are published.
Differential privacy data publishing based on deep learning and top-down approach
Definition 9 Density matrix. Initialize the N × N 2D grid region and construct a matrix D to represent the density of the region. The matrix element ρrc(r, c = 1, ⋯, N) indicates the density of the trajectories, where \({\rho}_{rc}=\frac{cn{t}_{rc}}{are{a}_{rc}}\), arearc represents the area of the grid. The matrix D is expressed as:
Definition 10 Regional division condition. In a 2D region, design a rule such that if the density of trajectories in the region is similar, then the region is divided into a sub-region, and privacy protection is provided for each sub-region. The rule is defined as:
where \(\overline{\rho}\) is the average density of the grid region, and ξ is the division factor; the smaller the factor, the stricter the condition for division.
Definition 11 Privacy budget matrix. Initialize the N × N 2D grid region and construct a matrix E to represent the result of the privacy budget allocation. The matrix element erc(r, c = 1, ⋯, N) indicates the size of the privacy budget allocation for the grid. erc has an initial value of 0. The matrix E is expressed as:
Privacy budget allocation strategies
In this study, we use the top-down division of regions to allocate the privacy budget; the range is divided into two parts for the sub-regions to be searched each time, and the depth of recursion is h(N = 2h). If the searched region satisfies the division condition, all grid locations in the region would be considered as a single location, and the allocated privacy budget would be adjusted. On the contrary, if the division condition is not satisfied, a new privacy budget would be reallocated to the region.
According to the definition of differential privacy, the smaller the value of ε is, the better the privacy protection of the region, and for dense regions, a smaller value of ε is expected to be allocated. On the contrary, for sparse regions, a larger value of ε is expected to be allocated. Therefore, according to the total differential privacy budget ε, the recursion depth is i ∈ {1, ⋯, h}, and ε is divided into an increasing sequence, which is defined as:
where \({a}_i=\frac{i\left(i+1\right)}{2}\) is a sequence of progressively increasing tolerances.
Divisible regional privacy budget allocation
If the region of the top-down recursive search satisfies the division condition, allocates the privacy budget to each grid location according to the recursion depth of the region. When the recursion depth is 1, that is \({e}_{rc}^{<i>}={\varepsilon}_i,i=1\), the privacy budget is allocated to ε1 if the initial grid region is divisible. When the recursion depth is not 1, the region with a smaller search range follows these rules.
Rule 1: From divisible region to divisible region. When the region of recursive depth i − 1 satisfies the division condition, the region of depth i also satisfies the condition. Thus, these regions are more similar. In this case, the objective is to gradually increase the minimum privacy budget policy to allocate the privacy budget to this region and achieve differential privacy. The privacy budget is adjusted as follows:
Rule 2: From indivisible region to divisible region. When the region of recursive depth i − 1 does not satisfy the division condition but the region of depth i does, it indicates that as the search range decreases, certain regions have better similarity, requiring the allocation of a larger privacy budget. In the recursive search regions, we use the MemoryState region to recall the strategy of whether the region searched at depth i − 1 satisfies the division condition. If the region does not satisfy the division condition, MemoryState is recorded as True; otherwise, it is recorded as False. Hence, the differential privacy budget summation of depth j = 1, ⋯, i − 1 can be found, to which the privacy budget of depth i is added. Such strategy can rapidly adjust the privacy budget allocation in the transition of region properties to achieve differential privacy. The adjusted privacy budget is as follows:
Indivisible regional privacy budget allocation
If the region of the top-down recursive search does not satisfy the division condition, allocates the privacy budget to each location according to the recursion depth of the region. When the recursion depth is 1, that is \({e}_{rc}^{<i>}={\varepsilon}_{h+1-i},i=1\), the privacy budget is allocated to εh if the initial grid region is indivisible. When the recursion depth is not 1, the region with a smaller search range follows these rules.
Rule 3: From divisible region to indivisible region. When the region of recursive depth i − 1 satisfies the division condition but the region with depth i does not, it indicates that as the search range decreases, certain regions have worse similarity, requiring a reallocation of a smaller privacy budget to achieve differential privacy. The privacy budget reallocation is as follows:
Rule 4: From indivisible region to divisible region. When the region of recursive depth i − 1 does not satisfy the division condition and the region of depth i satisfies the condition, it indicates that the similarity of both regions is worse. In this case, the objective is to gradually reduce the maximum privacy budget to reallocate the privacy budget to this region for differential privacy. The privacy budget reallocation is as follows:
Regional trackless privacy budget allocation
Rule 5: Trackless region. If a region of time interval Δt does not produce trajectory data, the region is a completely sparse region, and the privacy budget of the whole region is allocated as the total differential privacy budget ε.
Differential privacy availability
Data are published primarily to provide users with the service of querying the number of regional statistics. Predicting the privacy budget matrix for future moments enables the publishing of noise data in the query system in advance, avoiding the delay of traditional methods. When a user submits a query, the query range is noted as Qm, n, and three query cases exist.
-
(1)
When the query range Qm, n is a complete region in the top-down division process. According to Definition 6, the region may also have sub-regions with a higher similarity, and the privacy budget is adjusted by searching the sub-regions with a smaller range, which is similar to a gradual accumulation of privacy budget. The privacy budget of the sub-regions location is \(\varepsilon ={\sum}_{i=1}^h{\varepsilon}_i\); thus, the region is protected by the differential privacy with the strength of \({\varepsilon}_{Q_{m,n}}={\sum}_{i=1}^h{\varepsilon}_i\).
-
(2)
When the query range Qm, n is multiple sub-regions of the same depth in top-down division. This means that the range is the intersection of multiple regions. According to Definition 7, each set of division algorithms satisfies εi-differential privacy protection for the corresponding region location at depth i. In particular, the privacy budget of the divisible region is \(\varepsilon ={\sum}_{j=1}^{i-1}{\varepsilon}_j+{\varepsilon}_i\), and the privacy budget of the indivisible region is ε = εh + 1 − i. The region is protected by differential privacy with the strength of \({\varepsilon}_{Q_{m,n}}={\sum}_{j=1}^i{\varepsilon}_j+{\varepsilon}_{h+1-i}>\mathit{\max}\left\{{\varepsilon}_i\right\}\).
-
(3)
When the query range Qm, n is a trackless region. According to Section 5.1.3, the privacy budget for this region is allocated as the total differential privacy budget ε1. As the trackless region does not disclose the user’s privacy, it is unnecessary to add noise. When the user queries multiple time intervals {Δt1, ⋯, Δtx} of the same range Qm, n, the availability of privacy protection at other times Δ(t + s) is reduced if noise is added in this region, where s = 1, ⋯, x − t. Therefore, the region at that moment can be used without adding Laplace noise.
Model design by top-down
Combining privacy budget matrix prediction and data publishing methods, given a granularity size m, the top-down recursive division of the region and the deep learning model predicts the privacy budget matrix for future moments. According to the privacy budget allocation strategy for divisible and non-divisible regions, Laplace noise is added; to achieve differential privacy protection more effectively, no noise is added for trackless regions. Figure 4 illustrates the main process of the proposed privacy budget matrix prediction with the top-down division to achieve differential privacy for data publishing. The process is detailed as follows:
-
1)
Integration of data processes
-
(1)
The trajectory data are processed in days to obtain the collection {Ti | i = 1, ⋯, d}.
-
(2)
Data Ti are generated into a fine-grained collection {Ti, j | i = 1, ⋯, d, j = 1, ⋯, x} according to the time interval Δt.
-
(3)
The fine-grained data Ti, j are divided uniformly into a 2D grid of m × m; m is the granularity of division.
-
(4)
The count of the 2D grid region is determined to obtain the count matrix collection {Ci, j | i = 1, ⋯, d, j = 1, ⋯, x}.
-
(5)
The density of the 2D grid region is calculated to obtain the density matrix collection {Di, j | i = 1, ⋯, d, j = 1, ⋯, x}.
-
(6)
Regions are divided by the top-down division, and privacy budgets are allocated to get the privacy budget matrix collection {Ei, j | i = 1, ⋯, d, j = 1, ⋯, x}.
-
(7)
The privacy budget matrix collection is organized into a three-dimensional matrix collection {Mi, j | i = 1, ⋯, d, j = 1, ⋯, x} in chronological order.
-
(1)
-
2)
Prediction of the privacy budget matrix based on deep learning
-
(1)
The spatiotemporal data are inputted into deep learning models and trained iteratively.
-
(2)
The privacy budget matrix Md + 1, j for day d + 1 is predicted.
-
(1)
-
3)
Trajectory data publishing based on differential privacy
-
(1)
Laplace noise is distributed for each region location, i.e., \(cn{t}_{rc}^{\ast }= Lap\left(\frac{S}{\varepsilon_k}\right)\), where k ∈ {1, h}; εk is the privacy budget, and S is max{cntrc}.
-
(2)
The count matrix Cd + 1, j of the present data is calculated, and Laplace noise is added.
-
(3)
The effectiveness of differential privacy protection is evaluated, and the data are published.
-
(1)
Experiments
Data description
We evaluated the prediction performance of the deep learning model on two datasets: the Yonsei dataset and the Citi Bike dataset, as both datasets are related to trajectory density. Without loss of generality, we used the trajectory density as trajectory information in the experiments. Table 1 presents the details of both datasets.
In the experiments, the spatiotemporal trajectory data of the dimension [364, 24, 64 × 64, 1] were obtained by processing the original data, where 364 denotes the time span, 24 denotes the frequency of data publishing, 64 × 64 denotes the number of nodes, and 1 denotes the dimension of node features. The adjacency matrix was constructed according to whether an edge exists at each region location, and its dimension was [64 × 64, 64 × 64].
Evaluation metrics
Deep learning evaluation metrics
To evaluate the effectiveness of the model in privacy budget prediction, we used two practical criteria, namely regression and classification. The regression evaluation metrics were mean absolute error (MAE) and mean squared error (MSE). The classification evaluation metric was binary cross entropy (BCE). For the regression, let the true value of the privacy budget matrix be ytrue and the predicted value be ypred. For the classification, there are C categories, the true value of the privacy budget matrix is \({y}_c^{true}\), and the predicted value is \({y}_c^{pred}\). Let the sample size be N, then the formula for each index is expressed as:
The smaller the values of MAE and MSE, the more accurate the prediction. Additionally, the smaller the value of BCE, the better the prediction effect.
Differential privacy evaluation metrics
To evaluate the effectiveness of differential privacy, we used three measures of published data, including query error (QE), Jensen–Shannon (JS) divergence, and Kendall coefficient.
-
(1)
Query error [37]. Given a query function f(·), f(T) denotes the correct result for a query region T, where |T| denotes the size of the query region. Let \(f\left(\overset{\sim }{T}\right)\) denotes the noisy query result, and then the query error is defined as follows:
-
(2)
JS divergence [39]. Given probability distribution functions P, Q for original publishing data and noise-added publishing data, respectively, \(M=\frac{P+Q}{2}\). pi, qi are the probabilities of the distribution functions P, Q; then the JS divergence is defined as follows:
-
(3)
Kendall coefficient [40]. Given the original data X and noise-added data Y, we consider X and Y to be independently and identically distributed. When the observed sample (xi − xj)(yi − yj) > 0, it means that the two samples are overlapped and vice versa; thus, the Kendall coefficient is defined as follows:
where \({C}_l^2=\frac{l\left(l-1\right)}{2}\), \(\mathit{\operatorname{sgn}}(x)=\left\{\begin{array}{c}\ 1,x>0\\ {}\ 0,x=0\\ {}-1,x<0\end{array}\right.\).
Model training design
There are two kinds of spatiotemporal trajectory data: the divisible region with privacy budget \(\varepsilon ={\sum}_{i=1}^h{\varepsilon}_i\) and the indivisible region with a privacy budget ε = min {εi}. Influenced by geographic location, environment and time period, trajectory data are always unevenly distributed, with dense locations and sparse locations in regions. The trajectory data used in this study contained more sparse locations than dense locations; trajectory data was divided into a majority category as the divisible region and a minority category as the non-divisible region. Thus, the values of the majority category were mapped to 0 and the values of the minority category to 1. By using such a training strategy, the spatiotemporal trajectory data were predicted on a rolling basis. The loss functions for regression and classification were MSE and BCE, respectively; the optimizer was Adam, the activation function was Elu, and both early stoppages of training and learning rate adaptation were used to prevent model overfitting. In the experiments, 80% of the data was used as the training set, while the remaining 20% as the testing set. We predicted the privacy budget for the next 24 h.
Experimental results
Analysis of deep learning results
Table 2 presents the prediction results of Yonsei and Citi Bike datasets by different deep learning models, obtained by Eqs. (21), (22), and (23). The models include the convolutional neural network (CNN) [41], LSTM [30], ConvLSTM [11], ST-LSTM [34], and T-GCN [16]. The results of the different models were analyzed [42,43,44], and we obtained the following conclusions.
-
(1)
Small values of the evaluation metrics were obtained in the test sets of both datasets. Thus, the model correctly learned the temporal and spatial features of the data for an excellent performance.
-
(2)
Figure 5 shows that the method based on the spatiotemporal features (T-GCN) achieved better prediction precision than that of the other model. The T-GCN model is subsequently used to predict the privacy budget matrix for future moments which facilitates obtaining the Laplace noise at each location and publishing the noise data in advance.
Analysis of differential privacy results
To verify the privacy protection effect obtained from the test data, we used the T-GCN model to predict the privacy budget matrix, given the total privacy budget ε = {0.1, 0.3, 0.5, 0.7, 0.9}, AG division parameter α = 0.5, and division factor ξ = 0.5. We query the original publishing data and noise-added publishing data to obtain the query error, JS divergence, and Kendall coefficient for test data. The query range was extracted using UG, AG, and top-down division methods. Table 3 presents the parameters for the test data and query ranges.
We evaluated the degree of protection of the dataset with different total differential privacy budgets by modifying ε using three region-division methods; the top-down method requires the prediction of data for future moments using the T-GCN model. The data for 1 day was continuously queried by time interval, and the query range was {4 × 4, 8 × 8, 16 × 16, 32 × 32, 64 × 64}; the regional privacy protection evaluation metrics were averaged for each query time to obtain the average performance of each metric. Table 4 presents the results.
Figure 6 shows that the query error and JS divergence decreased as ε increased, whereas the Kendall coefficient increased as ε increased. The phenomenon occurred because as ε increases, the smaller the value of the Laplace distribution sampled to both sides is, reducing the added Laplace noise. The JS divergence of added noise shows that the top-down division outperformed the other methods. ×.
We further analyzed the degree of protection of original data at different query ranges. Given a total differential privacy budget ε = 0.5, the average performance of each metric was obtained by gradually increasing the query range and querying 1 day of trajectory data using different division methods. The results are presented in Table 5.
As shown in Fig. 7, we varied the query range to verify the performance of privacy protection, and the results show the following:
-
(1)
For the top-down division of regions, the query error, JS divergence, and Kendall coefficient are better than those of the traditional UG and AG methods, which is because the method proposed in this paper divides similar regions to allocate privacy budget, thus adding noise. The Laplace noise added in dense regions is larger than that in sparse regions. The reasonable addition of noise in each location achieves differential privacy protection and increases the usability of the dataset.
-
(2)
As the query range {4 × 4, 8 × 8, 16 × 16} increases, it is easier to query dense locations and the greater the noise added, increases QE by the noise factor when calculating the average query error using the original data; JS divergence also increases because of the difference between the original data and noise-added data. Here, the Kendall coefficient is nearly zero. This phenomenon occurs because the range-dense locations dominate, and the noise added to the regional locations affects the ranking with the size in the original data, which reduces the data ranking consistency.
-
(3)
When the query range is the whole region, both query error and JS divergence are reduced, indicating that the region contains numerous traceless locations. The Laplace noise added to these locations is small or zero. As this range is several times larger than the other query ranges, interference occurs from small noisy data when calculating these average metrics, reducing their values. The Kendall coefficient gradually increases because the ranking of the original data and the data after adding Laplace noise are more consistent when the data are sorted by a small noise, increasing the indicator.
Conclusion
In this paper, we propose a new privacy budget allocation method based on deep learning and a top-down division of regions to allocate privacy budgets for each location. We model the trajectory data by neural networks and differential privacy. First, we formulate the conditions for region division, use a recursive algorithm to divide similar regions by a top-down search, and design rules for location privacy budget allocation at different depths of recursion and division conditions. Subsequently, the spatial and temporal dependencies of grid regions are captured using the T-GCN model in comparison with other deep learning models. Finally, the T-GCN model with better prediction results is used to perform the spatiotemporal trajectory prediction task. Laplace noise is obtained according to the predicted privacy budget matrix, evaluated horizontally and vertically in two real datasets, and compared with UG and AG division methods. The top-down division is more effective and pragmatic in achieving differential privacy protection. In addition, the predicted privacy budget matrix can publish the noisy data in advance to prevent attackers from using the publishing time gap to obtain the real trajectory data. The method proposed in this paper not only predicts the privacy budgets that should be allocated at future moments, but also allocates privacy budgets more reasonably for each location while effectively achieving differential privacy protection.
Availability of data and materials
(1) The Yonsei dataset generated during this study is included in this published article (Method and system for privacy preservation of inter-trajectory correlation based on PSO [Invention]), which is available in the repository (https://download.csdn.net/download/sycong/11442172).
(2) The Citi Bike datasets generated during the current study are available in the Citi Bike System Data repository (https://www.citibikenyc.com/system-data).
Abbreviations
- CNN:
-
Convolutional neural network
- GCN:
-
Graph convolutional network
- GRU:
-
Gated recurrent unit
- LSTM:
-
Long short-term memory
- ConvLSTM:
-
Convolutional LSTM
- ST-LSTM:
-
Spatiotemporal LSTM
- T-GCN:
-
Temporal graph convolutional network
- MAE:
-
Mean absolute error
- MSE:
-
Mean squared error
- BCE:
-
Binary cross entropy
- UG:
-
Uniform grid
- AG:
-
Adaptive grid
- QE:
-
Query error
- JS:
-
Jensen–Shannon
References
Zhu L, Yu FR, Wang YG et al (2019) Big data analytics in intelligent transportation systems: a survey. IEEE Trans Intell Transp Syst 20(1):383–398
Dwork C (2008) Differential privacy: a survey of results. In: International conference on theory and applications of models of computation. Springer, Berlin, Heidelberg, pp 1–19
Yan Y, Cong YM, Mahmood A (2022) A deep learning-based method for statistical publishing and privacy protection of location big data. J Commun 43(01):203–216
Ahmed MS, Cook AR (1979) Analysis of freeway traffic time series data by using box-jenkins techniques. Transp Res Board 722:1–9
Wu CH, Ho JM, Lee DT (2004) Travel-time prediction with support vector regression. IEEE Trans Intell Transp Syst 5(4):276–281
Liu JJ, Yu SP (2016) A hidden Markov model-based method for spatio-temporal sequence prediction. Microcomput Appl 35(01):74–76+80. https://doi.org/10.19358/j.issn.1674-7720.2016.01.023
Huang W, Song G, Hong H, Xie K (2014) Deep architecture for traffic flow prediction: deep belief networks with multitask learning. IEEE Trans Intell Transp Syst 15(5):2191–2201
Zhou FY, Jin LP, Dong J (2017) A review of convolutional neural network research. J Comput Sci 40(06):1229–1251
Zhang J, Zheng Y, Qi D (2016) Deep spatio-temporal residual networks for citywide crowd flows prediction
Diehl F, Brunner T, Le MT et al (2019) Graph neural networks for modeling traffic participant interaction. In: 2019 IEEE intelligent vehicles symposium (IV). IEEE, p 695–701. Available: http://arxiv.org/abs/1903.01254
Yu B, Yin H, Zhu Z (2017) Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. arXiv preprint arXiv:1709.04875
Jin L, Yao C, Huang XY (2008) A nonlinear artificial intelligence ensemble prediction model for typhoon intensity. Mon Weather Rev 136(12):4541–4554
Wang J, Zhu R, Liu S et al (2018) Node location privacy protection based on differentially private grids in industrial wireless sensor networks. Sensors 18(2):410
Qardaji W, Yang WN, Li NH (2013) Differentially private grids for geospatial data. In: Proceedings of the IEEE 29th international conference on data engineering, Brisbane, Australia, pp 757–768
Zhang J, Xiao X, Xie X (2016) Privtree: a differentially private algorithm for hierarchical decompositions. In: Proceedings of the 2016 international conference on management of data, pp 155–170
Zhao L, Song Y, Zhang C et al (2019) T-gcn: a temporal graph convolutional network for traffic prediction. IEEE Trans Intell Transp Syst 21(9):3848–3858
Li W, Tao W, Zhou XY, Pan ZS (2020) A review of spatio-temporal sequence prediction methods. Comput Appl Res 37(10):2881–2888. https://doi.org/10.19734/j.issn.1001-3695.2019.05.0184
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Tompson J, Jain A, LeCun Y, and Bregler C (2014) Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS, p 1799–1807
Hinton G, Deng L, Yu D et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97
Ma J, Sheridan RP, Liaw A et al (2015) Deep neural nets as a method for quantitative structure–activity relationships. J Chem Inf Model 55(2):263–274
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, p 1097–1105
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, Cham, pp 234–241
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR, p 580–587, 2014. 1, 2
Begleiter R, El-Yaniv R, Yona G (2004) On prediction using variable order Markov models. J Artif Intell Res 22:385–421
Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014
Shi X, Chen Z, Wang H, Yeung D-Y, Wong W-k, Woo W-c (2015) Convolutional LSTM network: a machine learning approach for precipitation Nowcasting. In: NIPS, p 802–810
Wang Y, Long M, Wang J et al (2017) Predrnn: recurrent neural networks for predictive learning using spatiotemporal lstms. Adv Neural Inf Proces Syst 30:879–888
Shi X, Gao Z, Lausen L et al (2017) Deep learning for precipitation nowcasting: a benchmark and a new model. Adv Neural Inf Proces Syst 30:5617–5627
Wang X et al (2020) Traffic flow prediction via spatial-temporal graph neural network. In: Proceedings of the web conference 2020
Li Z et al (2019) A hybrid deep learning approach with GCN and LSTM for traffic flow prediction. In: 2019 IEEE intelligent transportation systems conference (ITSC), p 1929–1933
Bruna J, Zaremba W, Szlam A, Lecun Y (2013) Spectral networks and locally connected networks on graphs. CoRR, abs/1312.6203
Cho K, Merrienboer BV, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder-decoder approaches. In: Proceedings of the Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, p 103–111
Dwork C, Roth A (2014) The algorithmic foundations of differential privacy. Found Trends Theor Comput Sci 9(3–4):211–407
Cormode G, Procopiuc C, Srivastava D, Shen E, Yu T (2012) Differentially private spatial decompositions. In: Proceedings of the IEEE 28th International Conference on Data Engineering, Washington, DC, USA, pp 20–31
Yu Y, Si XS, Hu CH et al (2019) A review of recurrent neural networks: LSTM cells and network architectures. Neural Comput 31(7):1235–1270
Qiao S, Zeng Y, Zhou L, Liu Z, Ma J (2017) A secure authentication method of intelligent terminals based on jensen-shannon divergence,” in International Conference on Networking and Network Applications (NaNA), p 158–163
Zhang Q (2020) Correlation test of function-based data based on Kendall’s correlation coefficient. Northeast Normal University. https://doi.org/10.27011/d.cnki.gdbsu.2020.000389
Shi Y, Tian Y, Wang Y et al (2017) Sequential deep trajectory descriptor for action recognition with three-stream CNN. IEEE Trans Multimed 19(7):1510–1520
Iwendi C, Mohan S, Khan S, Ibeke E, Ahmadian A, Ciano T (2022) Covid-19 fake news sentiment analysis. Comput Electr Eng 101:107967. https://doi.org/10.1016/j.compeleceng.2022.107967 Epub 2022 Apr 22. PMID: 35474674; PMCID: PMC9023343
Kumar RL, Khan F, Din S, Band SS, Mosavi A, Ibeke E (2021) Recurrent neural network and reinforcement learning model for COVID-19 prediction. Front Public Health 9:744100. https://doi.org/10.3389/fpubh.2021.744100 PMID: 34671588; PMCID: PMC8521000
Iwendi C, Srivastava G, Khan S et al (2020) Cyberbullying detection solutions based on deep learning architectures. Multimed Syst. https://doi.org/10.1007/s00530-020-00701-5
Funding
The research received no funds.
Author information
Authors and Affiliations
Contributions
Jing Xiong: Propose real-time trajectory privacy protection based on improving differential privacy and deep learning model and formulate the grid division conditions of regions and design the rules of privacy budget allocation. We justify the privacy protection. The neural network code is coded with the deep learning framework Torch and TensorFlow, and after extensive experiments to verify deep learning results, a suitable set of parameters is selected as the final parameters of the model. Finally, the effectiveness of the proposed method is proved by differential privacy results analysis. Hong Zhu: Establish the experimental environment, which is Python-3.7, TensorFlow-2.8, and Torch-1.11. For the research investigation, we select the two commonly used trajectory datasets Yonsei and Citi Bike. Then, take some time to process the data into the desired format. The author(s) read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Xiong, J., Zhu, H. Real-time trajectory privacy protection based on improved differential privacy method and deep learning model. J Cloud Comp 11, 57 (2022). https://doi.org/10.1186/s13677-022-00332-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13677-022-00332-3