Abstract
Generating synthetic datasets is an innovative approach for data dissemination. Values at risk of disclosure or even the entire dataset are replaced with multiple draws from statistical models. The quality of the released data strongly depends on the ability of these models to capture important relationships found in the original data. Defining useful models for complex survey data can be difficult and cumbersome. One possible approach to reduce the modeling burden for data disseminating agencies is to rely on machine learning tools to reveal important relationships in the data.
This paper contains an initial investigation to evaluate whether support vector machines could be utilized to develop synthetic datasets. The application is limited to categorical data but extensions for continuous data should be straight forward. I briefly describe the concept of support vector machines and necessary adjustments for synthetic data generation. I evaluate the performance of the suggested algorithm using a real dataset, the IAB Establishment Panel. The results indicate that some data utility improvements might be achievable using support vector machines. However, these improvements come at the price of an increased disclosure risk compared to standard parametric modeling and more research is needed to find ways for reducing the risk. Some ideas for achieving this goal are provided in the discussion at the end of the paper.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bartlett, B., Jordan, M.I., McAuliffe, J.D.: Comment on: Moguerza, J.M. and Muñoz, A.: Support Vector Machines with Applications. Statistical Science (21), 341–345 (2006)
Berk, R.: Statistical Learning from a Regression Perspective. Springer, New York (2008)
Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal marign classifiers. In: Proceedings of the Fifth ACM Workshop on Computation Learning Theory (COLT), pp. 144–152. ACM Press, New York (1992)
Caiola, G., Reiter, J.P.: Random Forests for Generating Partially Synthetic, Categorical Data. Transactions on Data Privacy 3, 27–42 (2010)
Cristianini, N., Shawe-Taylor, J.: Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000)
Drechsler, J.: Synthetic Datasets for the German IAB Establishment Panel. Working paper for the Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality (2009)
Drechsler, J.: Multiple imputation of missing values in the wave 2007 of the IAB Establishment Panel. IAB Discussion Paper (6) (2010)
Drechsler, J., Bender, S., Rässler, S.: Comparing fully and partially synthetic data sets for statistical disclosure control in the German IAB Establishment Panel. Transactions on Data Privacy 1, 105–130 (2008)
Drechsler, J., Dundler, A., Bender, S., Rässler, S., Zwick, T.: A new approach for disclosure control in the IAB Establishment Panel–Multiple imputation for a better data access. Advances in Statistical Analysis 92, 439–458 (2008)
Drechsler, J., Reiter, J.P.: Accounting for intruder uncertainty due to sampling when estimating identification disclosure risks in partially synthetic data. In: Domingo-Ferrer, J., Saygin, Y. (eds.) Privacy in Statistical Databases, pp. 227–238. Springer, Heidelberg (2008)
Drechsler, J., Reiter, J.P.: Disclosure risk and data utility for partially synthetic data: An empirical study using the German IAB Establishment Survey. Journal of Official Statistics 25, 589–603 (2009)
Fienberg, S.E.: A radical proposal for the provision of micro-data samples and the preservation of confidentiality. Tech. rep., Department of Statistics, Carnegie-Mellon University (1994)
Fischer, G., Janik, F., Müller, D., Schmucker, A.: The IAB Establishment Panel – from sample to survey to projection. Tech. rep., FDZ- Methodenreport No. 1 (2008)
Gomatam, S., Karr, A.F., Reiter, J.P., Sanil, A.P.: Data dissemination and disclosure limitation in a world without microdata: A risk-utility framework for remote access servers. Statistical Science 20, 163–177 (2005)
Graham, P., Penny, R.: Multiply imputed synthetic data files. Tech. rep., University of Otago (2005), http://www.uoc.otago.ac.nz/departments/pubhealth/pgrahpub.htm
Graham, P., Young, J., Penny, R.: Multiply imputed synthetic data: Evaluation of hierarchical bayesian imputation models. Journal of Official Statistics 25, 407–426 (2009)
Hsu, C.-W., Chang, C.-C., Lin, C.-J.: A Practical Guide to Support Vector Classification. Technical report, Department of Computer Science, National Taiwan University (2010)
Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P.: A framework for evaluating the utility of data altered to protect confidentiality. The American Statistician 60, 224–232 (2006)
Kölling, A.: The IAB-Establishment Panel. Journal of Applied Social Science Studies 120, 291–300 (2000)
Lin, H.-T., Lin, C.-J., Weng, R.C.: A note on Platt’s probabilistic outputs for support vector machines. Technical report, Department of Computer Science, National Taiwan University (2003)
Little, R.J.A.: Statistical analysis of masked data. Journal of Official Statistics 9, 407–426 (1993)
Meng, X.-L.: Multiple-imputation inferences with uncongenial sources of input (disc: P558-573). Statistical Science 9, 538–558 (1994)
Moguerza, J.M., Muñoz, A.: Support Vector Machines with Applications (with discussion). Statistical Science (21), 322–362 (2006)
Platt, J.: Probabilities for SV machines. In: Smola, A., Bartlett, P., Schölkopf, B., Schuurmans, D. (eds.) Advances in Large Margin Classifiers, pp. 61–74. MIT Press, Cambridge (2000)
Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Survey Methodology 29, 181–189 (2003)
Reiter, J.P.: Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. Journal of the Royal Statistical Society, Series A 168, 185–205 (2005)
Reiter, J.P.: Using CART to generate partially synthetic, public use microdata. Journal of Official Statistics 21, 441–462 (2005)
Rubin, D.B.: Discussion: Statistical disclosure limitation. Journal of Official Statistics 9, 462–468 (1993)
Wahba, G.: Multivariate function and operator estimation, based on smoothing splines and reproducing kernels. In: Casdagli, M., Eubank, S. (eds.) Proc. of Nonlinear Modeling and Forcasting, SFI Studies in the Science of Complexity, vol. XII, pp. 95–112. Addison-Wesley, Reading (1992)
Wahba, G.: Support vector machines, reproducing kernel hilpert spaces and the erndomized GACV. In: Schölkopf, B., Burges, C.J.C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 69–88. MIT Press, Cambridge (1999)
Wu, T.-F., Lin, C.-J., Weng, R.C.: Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research 5, 975–1005 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Drechsler, J. (2010). Using Support Vector Machines for Generating Synthetic Datasets. In: Domingo-Ferrer, J., Magkos, E. (eds) Privacy in Statistical Databases. PSD 2010. Lecture Notes in Computer Science, vol 6344. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15838-4_14
Download citation
DOI: https://doi.org/10.1007/978-3-642-15838-4_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15837-7
Online ISBN: 978-3-642-15838-4
eBook Packages: Computer ScienceComputer Science (R0)