Abstract
In this paper, we tackle the problem of constructing a differentially private synopsis for the classification analysis. Several state-of-the-art methods follow the structure of existing classification algorithms and are all iterative, which is suboptimal due to the locally optimal choices and division of the privacy budget among many sequentially composed steps. We propose PrivPfC, a new differentially private method for releasing data for classification. The key idea underlying PrivPfC is to privately select, in a single step, a grid, which partitions the data domain into a number of cells. This selection is done by using the exponential mechanism with a novel quality function, which maximizes the expected number of correctly classified records by a histogram classifier. PrivPfC supports both the binary classification and the multiclass classification. Through extensive experiments on real datasets, we demonstrate PrivPfC ’s superiority over the state-of-the-art methods.
Similar content being viewed by others
References
Asuncion, A., Newman, D.: UCI machine learning repository (2010)
Bayardo, R.J., Agrawal, R.: Data privacy through optimal \(k\)-anonymization. In: ICDE, pp. 217–228 (2005)
Bishop, C.M.: Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, Secaucus (2006)
Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the SuLQ framework. In: PODS, pp. 128–138 (2005)
Chang, C.C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011)
Chaudhuri, K., Monteleoni, C.: Privacy-preserving logistic regression. In: NIPS, pp. 289–296 (2008)
Chaudhuri, K., Monteleoni, C., Sarwate, A.D.: Differentially private empirical risk minimization. J. Mach. Learn. Res. 12, 1069–1109 (2011)
Cormode, G., Srivastava, D., Li, N., Li, T.: Minimizing minimality and maximizing utility: analyzing method-based attacks on anonymized data. PVLDB 3(1–2), 1045–1056 (2010)
Devroye, L., Györfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Applications of Mathematics. Springer, Berlin (1996)
Dwork, C.: Differential privacy. In: ICALP, pp. 1–12 (2006)
Dwork, C., McSherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: TCC, pp. 265–284 (2006)
Fredrikson, M., Jha, S., Ristenpart, T.: Model inversion attacks that exploit confidence information and basic countermeasures. In: CCS, pp. 1322–1333 (2015)
Fredrikson, M., Lantz, E., Jha, S., Lin, S., Page, D., Ristenpart, T.: Privacy in pharmacogenetics: an end-to-end case study of personalized warfarin dosing. In: USENIX Security Symposium, pp. 17–32 (2014)
Friedman, A., Schuster, A.: Data mining with differential privacy. In: KDD, pp. 493–502 (2010)
Fung, B.C.M., Wang, K., Yu, P.S.: Top-down specialization for information and privacy preservation. In: ICDE, pp. 205–216 (2005)
Geng, X., Liu, T.Y., Qin, T., Li, H.: Feature selection for ranking. In: SIGIR, pp. 407–414 (2007)
Hay, M., Machanavajjhala, A., Miklau, G., Chen, Y., Zhang, D.: Principled evaluation of differentially private algorithms using dpbench. In: SIGMOD, pp. 139–154 (2016)
Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: KDD, pp. 279–288 (2002)
Jagannathan, G., Pillaipakkamnatt, K., Wright, R.N.: A practical differentially private random decision tree classifier. Trans. Data Priv. 5, 273–295 (2012)
Kotz, S., Kozubowski, T., Podgorski, K.: The Laplace Distribution and Generalizations: A Revisit with Applications to Communications, Economics, Engineering, and Finance. Springer, Berlin (2001)
LeFevre, K., DeWitt, D., Ramakrishnan, R.: Incognito: efficient full-domain \(k\)-anonymity. In: SIGMOD, pp. 49–60 (2005)
LeFevre, K., DeWitt, D., Ramakrishnan, R.: Mondrian multidimensional \(k\)-anonymity. In: ICDE, p. 25 (2006)
Lei, J.: Differentially private m-estimators. In: NIPS, pp. 361–369 (2011)
McSherry, F., Talwar, K.: Mechanism design via differential privacy. In: FOCS, pp. 94–103 (2007)
Minnesota Population Center: Integrated Public Use Microdata Series, International: Version 6.5 [dataset]. University of Minnesota, Minneapolis (2017). https://doi.org/10.18128/D020.V6.5
Mohammed, N., Chen, R., Fung, B.C.M., Yu, P.S.: Differentially private data release for data mining. In: KDD, pp. 493–501 (2011)
Qardaji, W., Yang, W., Li, N.: Differentially private grids for geospatial data. In: ICDE, pp. 757–768 (2013)
Qardaji, W., Yang, W., Li, N.: Understanding hierarchical methods for differentially private histograms. PVLDB 6(14), 1954–1965 (2013)
Ruggles, S., Genadek, K., Goeken, R., Grover, J., Sobek, M.: Integrated Public Use Microdata Series: Version 7.0 [dataset]. University of Minnesota, Minneapolis (2017). https://doi.org/10.18128/D010.V7.0
Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22–26, 2017, pp. 3–18 (2017)
Su, D., Cao, J., Li, N., Bertino, E., Jin, H.: Differentially private k-means clustering. In: Proceedings of the Sixth ACM on Conference on Data and Application Security and Privacy, CODASPY 2016, New Orleans, LA, USA, March 9–11, 2016, pp. 26–37 (2016)
Su, D., Cao, J., Li, N., Bertino, E., Lyu, M., Jin, H.: Differentially private k-means clustering and a hybrid approach to private optimization. ACM Trans. Priv. Secur. 20(4), 16:1–16:33 (2017)
Therneau, T.M., Atkinson, B.: Package: rpart (2014). http://cran.r-project.org/web/packages/rpart/rpart.pdf
Vinterbo, S A.: Differentially private projected histograms: construction and use for prediction. In: ECML PKDD’12, pp. 19–34 (2012)
Wong, R.C.W., Fu, A. W. C., Wang, K., Pei, J.: Minimality attack in privacy preserving data publishing. In: VLDB, pp. 543–554 (2007)
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, pp. 412–420 (1997)
Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Privbayes: private data release via bayesian networks. In: SIGMOD ’14, pp. 1423–1434 (2014)
Zhang, J., XiaoXia, X., Yang, Y., Zhang, Z., Winslett, M.: Privgene: differentially private model fitting using genetic algorithms. In: SIGMOD ’13, pp. 665–676 (2013)
Zhang, J., Zhang, Z., Xiao, X., Yang, Y., Winslett, M.: Functional mechanism: regression analysis under differential privacy. PVLDB 5(11), 1364–1375 (2012)
Acknowledgements
We thank the reviewers for their valuable comments. This paper is based upon work supported by the United States National Science Foundation under grants CNS-1116991 and CNS-1640374 and Key Laboratory on High Performance Computing, Anhui Province, NSFC (61672486, 61672480,11671376), Key Program of NSFC (71631006).
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
Lemma 3 gives the distribution of the difference of two i.i.d. Laplace random variables, which will be used in the proof of Lemma 1.
Lemma 3
[20] Let \(Z_1\) and \(Z_2\) be two i.i.d. random variables that follow the Laplace distribution with mean 0 and scale \(\frac{1}{\epsilon }\). Then the density of their difference \(Y = Z_1 - Z_2\) is
and the corresponding cumulative distribution function is
Proof of Lemma 1
Proof
We show the global sensitivity of the grid quality (Eq. 1) in the binary classification setting can be safely bounded by 1.1.
Given a dataset D, and without loss of generality we assume that D has 2 class labels \(\{1,2\}\) and the neighboring dataset of D is \(D^{\prime } = D - t\) , where the tuple t falls into cell e and has class label 1. Given a grid g, the quality values of all cells of it are the same except the cell e. In the cell e, we have the number of data points with label 1, \(n_{e}^{1}\), \({n'}_{e}^{1}\), and that with label 2, \(n_{e}^{2}\), \({n'}_{e}^{2}\) for D and \(D^{\prime }\), respectively, where \(n_{e}^{1}={n'}_{e}^{1}+1\) and \(n_{e}^{2}={n'}_{e}^{2}\). Each class also has a probability, \(p_{e}^{i}\), being the dominant class in cell e, where \(i = \{1,2\}\).
To compute the global sensitivity of the grid quality (Eq. 1), we first present the difference of the grid quality function for the neighboring datasets D and \(D'\) in terms of \(n_e^1, n_e^2, p_e^1\) and \(p_e^2\). Then we show that the difference can be bounded by 1.1.
The global sensitivity of the grid quality function for binary classification can be computed by,
where \(p_e^1\) is the probability of Class 1 is still the dominant class after adding noise. The last equality holds, because \(p_e^2=1-p_e^1\) and \(p_{e}^{'2}=1-p_{e}^{'1}\).
As for \(p_e^1\), by Lemma 3,
Since the probability \(p_e^1\) (Eq. 10) takes different forms depending on whether \(n_e^1-n_e^2 \ge 0\) or not, we analyze the global sensitivity \(\varDelta _{\textsf {gq}}\) by two cases: Class 1 is the dominant class in the cell e for D, \(n_e^1-n_e^2 \ge 1\) or not, \(n_e^1-n_e^2 \le 0\). We show that for \(n_e^1-n_e^2 \ge 1\), the upper bound of the global sensitivity is 1.1 and for \(n_e^1-n_e^2 \le 0\), the upper bound is 0.5. Note that the two conditions cover all cases, since \(n_e^1\) and \(n_e^{2}\) are integers. Therefore, the global sensitivity can be bounded by 1.1.
Case 1: \({\mathbf {n_e^1-n_e^2 \ge 1}}\).
In this case, \(n_e^{'1}-n_e^{'2} = n_e^1-1-n_e^2 \ge 0\). By Eqs. (9) and (10), we have
By letting \(x = n_e^1-n_e^2\), we have
where \(x \ge 1\).
For simplicity, let us consider the function \(g_{1}(x)=\frac{xe^{-\epsilon x}}{2}(1+\frac{\epsilon x}{2})\), and thus the sensitivity becomes \( \varDelta _{\textsf {gq}} = \left| 1+g_{1}(x-1)-g_{1}(x)\right| \).
Note that \(g_{1}(x)\) is differentiable and its derivative \(g_{1}^{\prime }(x)=\frac{e^{-\epsilon x}}{4}(2-\epsilon ^2 x^2)\). Thus, by Lagrange’s Mean Value Theorem, for any \(x \ge 1\), there exists some \(\xi \) between \(x-1\) and x (thus \(\xi >0\)), so that
To bound the expression above, consider another function
where \(x>0\). The function h(x) reaches the maximum at the point \(\frac{1+\sqrt{3}}{\epsilon }\) with the maximum value 1.1, increases in the interval \(\left[ 0, \frac{1+\sqrt{3}}{\epsilon }\right] \) and decreases in the interval \(\left( \frac{1+\sqrt{3}}{\epsilon }, \infty \right) \). When \(x \in [0, \frac{1+\sqrt{3}}{\epsilon }]\), \(h(0) \le h(x) \le h(\frac{1+\sqrt{3}}{\epsilon })\) which means \(h(x) \in [0.5, 1.1]\). When \(x \in \left( \frac{1+\sqrt{3}}{\epsilon }, \infty \right) \), h(x) decreases and lies in (1, 1.1], because \(\lim \nolimits _{x\rightarrow +\infty } h(x)=1\). Therefore, in this case,
Case 2: \({\mathbf {n_e^1-n_e^2 \le 0}}\)
In this case,
Similarly, by letting \(x = n_e^1-n_e^2\), we have
where \(x \le 0\).
Similarly to Case 1, let \(g_2(x)=\frac{xe^{\epsilon x }}{2}\left( 1-\frac{\epsilon x }{2}\right) \), and then the sensitivity becomes \(\varDelta _{\textsf {gq}} = \left| g_{2}(x)-g_{2}(x-1)\right| \).
The function \(g_2(x)\) is differentiable and its first order derivative is \(g_{2}^{\prime }(x)=\frac{e^{\epsilon x}}{4}\left( 2-\epsilon ^2 x^2\right) \). The derivative \(g_{2}^{\prime }(x)\) decreases when \(x \in \left( -\infty , -\frac{1+\sqrt{3}}{\epsilon }\right) \), increases when \(x\in \left[ -\frac{1+\sqrt{3}}{\epsilon }, 0\right] \). And when \(x=-\frac{1+\sqrt{3}}{\epsilon }\) the function \(g_{2}^{\prime }(x)\) reaches the minimum value \(-0.09\). Thus, when \(x \le -\frac{1+\sqrt{3}}{\epsilon }\), \(g_{2}^{\prime }(x) \in [-0.09, 0)\) because \(\lim \limits _{x\rightarrow -\infty } g_{2}^{\prime }(x)=0\) and \(g_{2}^{\prime }(x) \in [-0.09, 0.5]\) when \(x\in [-\frac{1+\sqrt{3}}{\epsilon }, 0]\). Applying Lagrange’s Mean Value Theorem to \(g_{2}(x)\), for any \(x \le 0\), there exists some \(\eta \) between \(x-1\) and x, thus \(\eta \le 0\), so that
In summary, by considering the above two cases, the global sensitivity for grid quality on binary classification \(\varDelta _{\textsf {gq}}\) is bounded by 1.1 and reaches its maximum when Case 1 holds, at
The global maximum point \(x^{*}\) is obtained by taking derivative of \(1+g_{1}(x-1)-g_{1}(x)\). This completes the proof. \(\square \)
Proof of Lemma 2
Proof
Given a dataset D, without loss of generality we assume that the neighboring dataset of D is \(D^{\prime } = D - t\), where the tuple t is in cell e. The quality values of all cells other than cell e are the same. Denote \(l_t\) as the class label of t.
Since the approximation of the grid quality (Eq. 5) involves only the top two classes, and the quality values of all cells other than cell e are the same, the difference of the approximated grid quality function \(\varDelta _{\textsf {gq}}=|\textsf {gq} (D, g)-\textsf {gq} (D', g)|\) is 0 or not depends on whether the count of Class \(l_t\) is in the top two class counts of cell e in D and \(D'\). It is worth pointing out that the rank of Class \(l_t\) does not rise in \(D'= D-t\) because deleting the tuple t can only decrease the count of Class \(l_t\). Therefore, we bound \(\varDelta _{\textsf {gq}}\) by considering three separated cases: (1) Class \(l_t\) is not in the top two classes in D and \(D'\), (2) Class \(l_t\) is in the top two classes in D and \(D'\), and (3) Class \(l_t\) is in the top two classes in D but not in \(D'\). For convenience, we use \(l_t \in \{(1), (2)\}\) to represent the fact that the class count of lt is in top 2, and \(l_t \notin \{(1), (2)\}\) to represent that it is not.
Case 1: \(l_t \notin \{(1),(2)\}\) in both D and \(D'\). Class \(l_t\) does not rank in top-2 in both D and \(D'\). In this case, deleting the tuple t does not affect the first 2 classes. \(n_e^{(1)}=n_e^{'(1)}, n_e^{(2)}=n_e^{'(2)}, p_e^{(1)}=p_e^{'(1)}\) and \(p_e^{(2)}=p_e^{'(2)}\). So \(\textsf {gq} (D,g)=\textsf {gq} (D',g)\), which means \(\varDelta _{\textsf {gq}}=0\).
Case 2: \(l_t \in \{(1),(2)\}\) in both D and \(D'\). Class \(l_t\) ranks in top-2 in both D and \(D'\). In this case, the rank of \(l_t\) may change. So let us consider the following subcases.
Subcase 2.1: Class \(l_t\) ranks first (resp. second) in both D and \(D'\). Then, we have \(n_e^{(1)}=n_e^{'(1)}+1\) and \(n_e^{(2)}=n_e^{'(2)}\) (resp. \(n_e^{(1)}=n_e^{'(1)}\) and \(n_e^{(2)}=n_e^{'(2)}+1\) ). Similar to the proof of Lemma 1, we obtain \(\varDelta _{\textsf {gq}} \le 1.1\).
Subcase 2.2: Class \(l_t\) ranks first in D and ranks second in \(D'\). Deleting one tuple makes the class with the highest count become the second highest class, which occurs only when there is a tie between two highest classes in D. (This tie can be resolved the alphabetical order of class labels.) In this case, we have
Thus,
where the third equality holds because of Eq. (11) and the fourth equality holds because \(p_{e}^{(2)}=1-p_{e}^{(1)}\) and \(p_{e}^{'(2)}=1-p_{e}^{'(1)}\).
Case 3: \(l_t \in \{(1),(2)\}\) in D and \(l_t \notin \{(1),(2)\}\) in \(D'\). Class \(l_t\) ranks in top-2 in D, but does not rank in top-2 in \(D'\). Similar to Subcase 2.2, this case occurs only when there is a tie among Class \(l_t\) and other \(s (\ge 1)\) classes which are not ranked in top-2. Deleting the tuple t makes \(l_t\) ranked out of top-2. So, by our tie resolving rule, one of the s classes with the same count as Class \(l_t\) is ranked into top-2. For example, suppose the number of class \(k =3\), \(n_{e}^{1} = n_{e}^{2} = n_{e}^{3} = 10\), the first highest classes with counts \(n_{e}^{(1)}=n_{e}^{(2)}=10\). After removing the tuple t with label 1, we have \(n_{e}^{\prime 2} = n_{e}^{\prime 3} = 10\), and \(n_{e}^{\prime 1} = 9\). The class containing t is ranked out of top-2 and the class with label 3 is ranked into top-2, and the counts of the top two classes in \(D'\) are still the same, which means \(n_{e}^{'(1)}=n_{e}^{'(2)}=10\). That is to say,
Similarly, we have
where the second equality holds because of Eq. (12).
In summary, the global sensitivity for the approximation of grid quality on multiclass classification \(\varDelta _{\textsf {gq}} \le 1.1\). \(\square \)
Rights and permissions
About this article
Cite this article
Su, D., Cao, J., Li, N. et al. PrivPfC: differentially private data publication for classification. The VLDB Journal 27, 201–223 (2018). https://doi.org/10.1007/s00778-017-0492-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-017-0492-3