Abstract
It is often necessary for organizations to perform data mining tasks collaboratively without giving up their own data. This necessity has led to the development of privacy preserving distributed data mining. Several protocols exist which deal with data mining methods in a distributed scenario but most of these methods handle a single data mining task. Therefore, if the participating parties are interested in more than one classification methods they will have to go through a series of distributed protocols every time, thus increasing the overhead substantially. A second significant drawback with existing methods is that they are often quite expensive due to the use of encryption operations. In this paper a method has been proposed that addresses both these issues and provides a generic approach to efficient privacy preserving classification analysis in a distributed setting with a worst-case privacy guarantee. The experimental results demonstrate the effectiveness of this method.
Similar content being viewed by others
References
Aggarwal, C.C., Yu, P.S.: A condensation approach to privacy preserving data mining. In: 9th International Conference on Extending Database Technology, Heraklion, Crete, Greece (2004)
Aggarwal, C.C., Yu, P.S.: Privacy-Preserving Data Mining: Models and Algorithms. Springer, Berlin (2008)
Agrawal, D., Aggarwal, C.C.: On the design and quantification of privacy preserving data mining algorithms. In: 20th ACM PODS, Santa Barbara, CA, pp. 247–255 (2001)
Agrawal, S., Haritsa, J.R.: A framework for high-accuracy privacy-preserving mining. In: ICDE (2005)
Agrawal, R., Srikant, R.: Privacy preserving data mining. In: 2000 ACM SIGMOD, Dallas, TX, May 2000, pp. 439–450 (2000)
Blum, A., Dwork, C., McSherry, F., Nissim, K.: Practical privacy: the sulq framework. In: PODS (2005)
Blum, A., Ligett, K., Roth, A.: A learning theory approach to non-interactive database privacy. In: Proceedings of the 40th Annual ACM Symposium on Theory of Computing, STOC’08, pp. 609–618. ACM, New York (2008). http://doi.acm.org/10.1145/1374376.1374464
Caragea, D., Silvescu, A., Honavar, V.: Decision tree induction from distributed, heterogeneous, autonomous data sources. In: Conference on Intelligent Systems Design and Applications (2003)
Chen, K., Liu, L.: A random rotation perturbation approach to privacy-preserving data classification. In: ICDM 2005, Houston, TX, November 2005
Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., Zhu, M.: Tools for privacy preserving distributed data mining. ACM SIGKDD Explor. 4, 28–34 (2002)
Dalenius, T., Reiss, S.P.: Data-swapping: a technique for disclosure control. J. Stat. Plan. Inference 6, 73–85 (1982)
Du, W., Zhan, Z.: Building decision tree classifier on private data. In: IEEE International Conference on Privacy, Security and Data Mining, Maebashi City, Japan, December 2002, pp. 1–8 (2002)
Du, W., Zhan, Z.: Using randomized response techniques for privacy preserving data mining. In: 9th ACM SIGKDD, Washington, DC, August 2003, pp. 505–510 (2003)
Dwork, C.: Differential privacy. In: ICALP, pp. 1–12 (2006)
Dwork, C.: Differential privacy: a survey of results. In: Proceedings of the 5th International Conference on Theory and Applications of Models of Computation, TAMC’08, pp. 1–19. Springer, Berlin (2008). http://dl.acm.org/citation.cfm?id=1791834.1791836
Dwork, C., Mcsherry, F., Nissim, K., Smith, A.: Calibrating noise to sensitivity in private data analysis. In: Proceedings of the 3rd Theory of Cryptography Conference, pp. 265–284. Springer, Berlin (2006)
Dwork, C., Naor, M., Reingold, O., Rothblum, G.N., Vadhan, S.: On the complexity of differentially private data release: efficient algorithms and hardness results. In: Proceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC’09, pp. 381–390. ACM, New York (2009). http://doi.acm.org/10.1145/1536414.1536467
Evfimevski, A., Gehrke, J., Srikant, R.: Limiting privacy breaches in privacy preserving data mining. In: 22nd ACM PODS, San Diego, CA, June 2003, pp. 211–222 (2003)
Feldman, D., Fiat, A., Kaplan, H., Nissim, K.: Private coresets. In: Proceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC’09, pp. 361–370. ACM, New York (2009). http://doi.acm.org/10.1145/1536414.1536465
Fienberg, S.E., McIntyre, J.: Data-swapping: variations on a theme by Dalenius and Reiss. Tech. rep., National Institute of Statistical Sciences (2003)
Frank, A., Asuncion, A.: UCI machine learning repository (2010). http://archive.ics.uci.edu/ml
Gal, T., Chen, Z., Gangopadhyay, A.: A privacy protection model for patient data with multiple sensitive attributes. Int. J. Inf. Secur. Priv. 2(3), 28–44 (2008)
Giannella, C., Liu, K., Olsen, T., Kargupta, H.: Communication efficient construction of decision trees over heterogeneously distributed data. In: Fourth IEEE International Conference on Data Mining (2004)
Goethals, B., Laur, S., Lipmaa, H., Mielikainen, T.: On secure scalar product computation for privacy-preserving data mining. In: The 7th Annual International Conf. in Information Security and Cryptology (2004)
Huang, Z., Du, W., Chen, B.: Deriving private information from randomized data. In: SIGMOD 2005, Baltimore, MD, June 2005, pp. 37–48 (2005)
Jagannathan, G., Wright, R.N.: Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In: SIGKDD’05, Chicago, IL, pp. 593–599 (2005)
Kantarcioglu, M., Clifton, C.: Privacy-preserving distributed mining of association rules on horizontally partitioned data. IEEE Trans. Knowl. Data Eng. 16(9), 1026–1037 (2004)
Kantarcioglu, M., Vaidya, J.: Privacy preserving naïve Bayes classifier for horizontally partitioned data. In: IEEE ICDM Workshop on Privacy Preserving Data Mining, Melbourne, FL, November 2003, pp. 3–9 (2003)
Kargupta, H., Park, B.H.: A Fourier spectrum-based approach to represent decision trees for mining data streams in mobile environments. IEEE Trans. Knowl. Data Eng. 16(2), 216–229 (2004)
Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: On the privacy preserving properties of random data perturbation techniques. In: ICDM, pp. 99–106 (2003)
Kim, J.J., Winkler, W.E.: Multiplicative noise for masking continuous data. Tech. rep. 2003-01, Statistical Research Division, U.S. Bureau of the Census, April 2003
Kim, D., Chen, Z., Gangopadhyay, A.: Optimizing privacy-accuracy tradeoff for privacy preserving distance-based classification. Int. J. Inf. Secur. Priv. 6(2), 16–33 (2012)
Li, C., Hay, M., Rastogi, V., Miklau, G., McGregor, A.: Optimizing linear counting queries under differential privacy. In: Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS’10, pp. 123–134. ACM, New York (2010). http://doi.acm.org/10.1145/1807085.1807104
Lin, X., Clifton, C., Zhu, Y.: Privacy preserving clustering with distributed em mixture modeling. Int. J. Knowl. Inf. Syst. 8(1), 68–81 (2005)
Lindell, Y., Pinkas, B.: Privacy preserving data mining. In: Advances in Cryptology (CRYPTO’00). Lecture Notes in Computer Science, vol. 180, pp. 36–53 (2000)
Liu, K., Giannella, C., Kargupta, H.: An attacker’s view of distance preserving maps for privacy preserving data mining. In: The 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’06) (2006)
Liu, K., Kargupta, H., Ryan, J.: Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Trans. Knowl. Data Eng. 18(1), 92–106 (2006)
Ma, D., Sivakumar, K., Kargupta, H.: Privacy sensitive Bayesian network parameter learning. In: 4th IEEE International Conference on Data Mining (ICDM’04), Brighton, UK, November 2004, pp. 487–490 (2004)
Mcsherry, F.: Mechanism design via differential privacy. In: Proceedings of the 48th Annual Symposium on Foundations of Computer Science (2007)
Mukherjee, S., Chen, Z., Gangopadhyay, A.: A privacy preserving technique for Euclidean distance-based mining algorithms using Fourier-related transforms. VLDB J. 15(4), 292–315 (2006)
Mukherjee, S., Banerjee, M., Chen, Z., Gangopadhyay, A.: A privacy preserving technique for distance-based classification with worst case privacy guarantees. Data Knowl. Eng. 66(2), 264–288 (2008)
Mukherjee, S., Chen, Z., Gangopadhyay, A.: A fuzzy programming approach for data reduction and privacy in distance based mining. Int. J. Inf. Comput. Secur. (in press)
Oliveira, S., Zaïane, O.R.: Privacy preserving clustering by data transformation. In: 18th Brazilian Symposium on Databases, pp. 304–318 (2003)
Paillier, P.: Public-key cryptosystems based on composite degree residuosity classes. In: EUROCRYPT, pp. 223–238. Springer, Berlin (1999)
Roth, A., Roughgarden, T.: Interactive privacy via the median mechanism. In: Proceedings of the 42nd ACM Symposium on Theory of Computing, STOC’10, pp. 765–774. ACM, New York (2010). http://doi.acm.org/10.1145/1806689.1806794
Sweeney, L.: K-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002)
Vaidya, J.: Towards a holistic approach to privacy-preserving data analysis. In: Workshop on Secure Knowledge Management (2008)
Vaidya, J.S., Clifton, C.: Privacy preserving association rule mining in vertically partitioned data. In: 8th ACM SIGKDD, Edmonton, Canada, July 2002, pp. 639–644 (2002)
Vaidya, J.S., Clifton, C.: Privacy-preserving k-means clustering over vertically partitioned data. In: 9th ACM SIGKDD, Washington, DC, August 2003, pp. 206–215 (2003)
Vaidya, J., Clifton, C.: Privacy-preserving decision trees over vertically partitioned data. In: Proceedings of the IFIP WG 11.3 International Conference on Data and Applications Security, pp. 139–152. Springer, Berlin (2005)
Vaidya, J., Clifton, C., Zhu, M.: Privacy Preserving Data Mining. Springer, Berlin (2005)
Vaidya, J., Kantarcioglu, M., Clifton, C.: Privacy-preserving naïve Bayes classification. VLDB J. 17(4), 879–898 (2008)
Warner, S.: Randomized response: a survey technique for eliminating evasive answer bias. J. Am. Stat. Assoc. 60(309), 63–69 (1965)
Wright, R., Yang, Z.: Privacy-preserving Bayesian network structure computation on distributed heterogeneous data. In: 10th ACM SIGKDD Conference (SIGKDD’04), Seattle, WA, August 2004, pp. 713–718 (2004)
Xiao, X., Bender, G., Hay, M., Gehrke, J.: Ireduct: differential privacy with reduced relative errors. In: Proceedings of the 2011 International Conference on Management of Data, SIGMOD’11, pp. 229–240. ACM, New York (2011). http://doi.acm.org/10.1145/1989323.1989348
Xiao, X., Wang, G., Gehrke, J.: Differential privacy via wavelet transforms. IEEE Trans. Knowl. Data Eng. 23, 1200–1214 (2011). doi:10.1109/TKDE.2010.247
Yao, A.C.: How to generate and exchange secrets. In: 27th IEEE Symposium on Foundations of Computer Science, pp. 162–167 (1986)
Yu, H., Jiang, X., Vaidya, J.: Privacy-preserving svm using nonlinear kernels on horizontally partitioned data. In: Proceedings of the 2006 ACM Symposium on Applied Computing, SAC’06, pp. 603–610 (2006)
Yu, H., Vaidya, J., Jiang, X.: Privacy-preserving svm classification on vertically partitioned data. In: PAKDD, pp. 647–656 (2006)
Zhu, Y., Liu, L.: Optimal randomization for privacy preserving data mining. In: KDD, pp. 761–766 (2004)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Elena Ferrari.
Appendix
Appendix
In the appendix we give brief summary of the SMC protocols we use in this paper.
Secure sum protocol
The secure sum protocol [10] uses homomorphic encryption. More specifically, suppose we want to compute the sum of x i (1≤i≤r) where x i belongs to party i. Two parties are first randomly selected. Without loss of generality, suppose party 1 and 2 are chosen. Party 1 then generates a pair of public and private keys for a homomorphic encryption scheme. It then sends the public key to the other parties. Each party uses the public key to encrypt x i and then sends the encrypted x i to party 2. Party 2 then computes the encrypted sum without decrypting each individual x i using the property of homomorphic encryption. Party 2 then sends the result back to party 1, which will decrypt the result and distribute the sum to each party.
The secure sum protocol is used in our algorithm for horizontally partitioned case to compute number of elements in class, class mean, global mean, global S b , and global S w , all of which are essentially sum of local shares. The secure sum protocol is also used by our algorithm for vertically partitioned case to compute the result of LDA transform (\(\hat{D}_{t}\)).
The secure sum protocol just needs r encryption operations (r as the number of parties) and 1 decryption operation. So both the communication overhead and computational overhead is O(r).
Secure scalar product
We use the secure scalar product protocol that was first proposed in [24] and later used in [26]. Suppose two parties A and B each has a vector x A and x B. We want to compute the scalar product of x A⋅x B. The protocol also uses homomorphic encryption. Party A generates a pair of public and private keys and sends the public key to B. Party A then computes encrypted value for x A and sends it to Party B. Party B selects a random value s B and uses the property of homomorphic encryption to compute the encryption of x A⋅x B−s B and sends it back to Party A. Party A decrypts the result and gets s A =x A⋅x B−s B . Now each party has a random share (s A and s B ) for the scalar product.
The secure scalar product is used in the secure K-Means clustering [26] algorithm to compute distances. We also use it in our algorithm for the vertically partitioned case to compute S b and S w . Let us first consider S b . Let vector x i (1≤i≤m)be a 1 by c vector where the j-th value x ij is the value of \(\mu_{j}-\bar{x}\) on column i. So the value at i-th row and j-th column of S b is essentially scalar product of x i and x j . Similarly, for S w , suppose y lj is a 1 by |c l | vector where the i-th element is the value of the i-th row in class c l on column j minus the value of μ l on column j. The value at i-th row and j-th column of S w is the sum of a series of c (c as the number of classes) scalar products where each scalar product equals y li ⋅y lj (i.e., the scalar product computed in class c l but over the column i and j).
The protocol requires n encryption operations and one decryption operation for a length n vector. Since the encrypted value of a data element may be used in several scalar product computations (e.g., column 1 will be used to compute the scalar product with column 2, 3, …, ), we can reuse the encrypted value. Thus each data element needs to be encrypted at most once. Our algorithm requires O(mn) encryption operations to compute S b and S w in the vertically partitioned case.
Secure comparison protocol
We use the secure comparison protocol to find the closest cluster and the range of each column in our algorithms for horizontally and vertically partitioned cases. The protocol was proposed by Yao at [57].
Secure K-Means clustering
We use the secure K-Means clustering protocol proposed in [26]. This protocol also calls secure sum, secure scalar protocol, and Yao’s secure comparison protocol. The secure K-Means clustering protocol requires O(sngrr k ) encryption operations where n is number of rows in data, s is number of columns after LDA, r is number of parties, g is the number of clusters, and r k is the number of iterations in K-Means clustering.
The overhead of these SMC protocols are dominated by the cost of encryption. In the horizontally partitioned case, our algorithm requires O(mcr+sgrr k ) encryption operations. Note that this is irrelevant to the number of rows in the data set because each party can do most computations locally. So the algorithm is efficient. In the vertically partitioned case, our algorithm requires O(mn+sngrr k ) encryption operations. Encryption can be quite expensive in practice, especially for large data sets under the vertically partitioned case. However, as stated in [24], the current hardware and some optimization tricks (such as reusing encrypted values) can make the computational overhead tolerable.
Rights and permissions
About this article
Cite this article
Banerjee, M., Chen, Z. & Gangopadhyay, A. A generic and distributed privacy preserving classification method with a worst-case privacy guarantee. Distrib Parallel Databases 32, 5–35 (2014). https://doi.org/10.1007/s10619-013-7126-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-013-7126-6