Using Support Vector Machines for Generating Synthetic Datasets

Jörg Drechsler¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6344))

Included in the following conference series:

International Conference on Privacy in Statistical Databases

1457 Accesses
9 Citations

Abstract

Generating synthetic datasets is an innovative approach for data dissemination. Values at risk of disclosure or even the entire dataset are replaced with multiple draws from statistical models. The quality of the released data strongly depends on the ability of these models to capture important relationships found in the original data. Defining useful models for complex survey data can be difficult and cumbersome. One possible approach to reduce the modeling burden for data disseminating agencies is to rely on machine learning tools to reveal important relationships in the data.

This paper contains an initial investigation to evaluate whether support vector machines could be utilized to develop synthetic datasets. The application is limited to categorical data but extensions for continuous data should be straight forward. I briefly describe the concept of support vector machines and necessary adjustments for synthetic data generation. I evaluate the performance of the suggested algorithm using a real dataset, the IAB Establishment Panel. The results indicate that some data utility improvements might be achievable using support vector machines. However, these improvements come at the price of an increased disclosure risk compared to standard parametric modeling and more research is needed to find ways for reducing the risk. Some ideas for achieving this goal are provided in the discussion at the end of the paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 53.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Resampling methods for generating continuous multivariate synthetic data for disclosure control

Article 24 July 2021

Advantages of Imputation vs. Data Swapping for Statistical Disclosure Control

Inference for Singly Imputed Synthetic Data Based on Posterior Predictive Sampling under Multivariate Normal and Multiple Linear Regression Models

Article 20 August 2015

References

Bartlett, B., Jordan, M.I., McAuliffe, J.D.: Comment on: Moguerza, J.M. and Muñoz, A.: Support Vector Machines with Applications. Statistical Science (21), 341–345 (2006)
Article MathSciNet Google Scholar
Berk, R.: Statistical Learning from a Regression Perspective. Springer, New York (2008)
MATH Google Scholar
Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal marign classifiers. In: Proceedings of the Fifth ACM Workshop on Computation Learning Theory (COLT), pp. 144–152. ACM Press, New York (1992)
Chapter Google Scholar
Caiola, G., Reiter, J.P.: Random Forests for Generating Partially Synthetic, Categorical Data. Transactions on Data Privacy 3, 27–42 (2010)
Google Scholar
Cristianini, N., Shawe-Taylor, J.: Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000)
Google Scholar
Drechsler, J.: Synthetic Datasets for the German IAB Establishment Panel. Working paper for the Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality (2009)
Google Scholar
Drechsler, J.: Multiple imputation of missing values in the wave 2007 of the IAB Establishment Panel. IAB Discussion Paper (6) (2010)
Google Scholar
Drechsler, J., Bender, S., Rässler, S.: Comparing fully and partially synthetic data sets for statistical disclosure control in the German IAB Establishment Panel. Transactions on Data Privacy 1, 105–130 (2008)
Google Scholar
Drechsler, J., Dundler, A., Bender, S., Rässler, S., Zwick, T.: A new approach for disclosure control in the IAB Establishment Panel–Multiple imputation for a better data access. Advances in Statistical Analysis 92, 439–458 (2008)
Article Google Scholar
Drechsler, J., Reiter, J.P.: Accounting for intruder uncertainty due to sampling when estimating identification disclosure risks in partially synthetic data. In: Domingo-Ferrer, J., Saygin, Y. (eds.) Privacy in Statistical Databases, pp. 227–238. Springer, Heidelberg (2008)
Chapter Google Scholar
Drechsler, J., Reiter, J.P.: Disclosure risk and data utility for partially synthetic data: An empirical study using the German IAB Establishment Survey. Journal of Official Statistics 25, 589–603 (2009)
Google Scholar
Fienberg, S.E.: A radical proposal for the provision of micro-data samples and the preservation of confidentiality. Tech. rep., Department of Statistics, Carnegie-Mellon University (1994)
Google Scholar
Fischer, G., Janik, F., Müller, D., Schmucker, A.: The IAB Establishment Panel – from sample to survey to projection. Tech. rep., FDZ- Methodenreport No. 1 (2008)
Google Scholar
Gomatam, S., Karr, A.F., Reiter, J.P., Sanil, A.P.: Data dissemination and disclosure limitation in a world without microdata: A risk-utility framework for remote access servers. Statistical Science 20, 163–177 (2005)
Article MATH MathSciNet Google Scholar
Graham, P., Penny, R.: Multiply imputed synthetic data files. Tech. rep., University of Otago (2005), http://www.uoc.otago.ac.nz/departments/pubhealth/pgrahpub.htm
Graham, P., Young, J., Penny, R.: Multiply imputed synthetic data: Evaluation of hierarchical bayesian imputation models. Journal of Official Statistics 25, 407–426 (2009)
Google Scholar
Hsu, C.-W., Chang, C.-C., Lin, C.-J.: A Practical Guide to Support Vector Classification. Technical report, Department of Computer Science, National Taiwan University (2010)
Google Scholar
Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P.: A framework for evaluating the utility of data altered to protect confidentiality. The American Statistician 60, 224–232 (2006)
Article MathSciNet Google Scholar
Kölling, A.: The IAB-Establishment Panel. Journal of Applied Social Science Studies 120, 291–300 (2000)
Google Scholar
Lin, H.-T., Lin, C.-J., Weng, R.C.: A note on Platt’s probabilistic outputs for support vector machines. Technical report, Department of Computer Science, National Taiwan University (2003)
Google Scholar
Little, R.J.A.: Statistical analysis of masked data. Journal of Official Statistics 9, 407–426 (1993)
Google Scholar
Meng, X.-L.: Multiple-imputation inferences with uncongenial sources of input (disc: P558-573). Statistical Science 9, 538–558 (1994)
Google Scholar
Moguerza, J.M., Muñoz, A.: Support Vector Machines with Applications (with discussion). Statistical Science (21), 322–362 (2006)
Article MathSciNet Google Scholar
Platt, J.: Probabilities for SV machines. In: Smola, A., Bartlett, P., Schölkopf, B., Schuurmans, D. (eds.) Advances in Large Margin Classifiers, pp. 61–74. MIT Press, Cambridge (2000)
Google Scholar
Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Survey Methodology 29, 181–189 (2003)
Google Scholar
Reiter, J.P.: Releasing multiply-imputed, synthetic public use microdata: An illustration and empirical study. Journal of the Royal Statistical Society, Series A 168, 185–205 (2005)
MATH MathSciNet Google Scholar
Reiter, J.P.: Using CART to generate partially synthetic, public use microdata. Journal of Official Statistics 21, 441–462 (2005)
Google Scholar
Rubin, D.B.: Discussion: Statistical disclosure limitation. Journal of Official Statistics 9, 462–468 (1993)
Google Scholar
Wahba, G.: Multivariate function and operator estimation, based on smoothing splines and reproducing kernels. In: Casdagli, M., Eubank, S. (eds.) Proc. of Nonlinear Modeling and Forcasting, SFI Studies in the Science of Complexity, vol. XII, pp. 95–112. Addison-Wesley, Reading (1992)
Google Scholar
Wahba, G.: Support vector machines, reproducing kernel hilpert spaces and the erndomized GACV. In: Schölkopf, B., Burges, C.J.C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 69–88. MIT Press, Cambridge (1999)
Google Scholar
Wu, T.-F., Lin, C.-J., Weng, R.C.: Probability estimates for multi-class classification by pairwise coupling. Journal of Machine Learning Research 5, 975–1005 (2004)
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Employment Research, Regensburger Str. 104, 90478, Nuremberg, Germany
Jörg Drechsler

Authors

Jörg Drechsler
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Engineering and Mathematics, Universitat Rovira i Virgili, UNESCO Chair in Data Privacy, Av. Països Catalans 26, E-43007, Tarragona, Catalonia
Josep Domingo-Ferrer
Department of Informatics, Ionian University, Plateia Tsirigoti 7, 49100, Kerkyra, Greece
Emmanouil Magkos

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Drechsler, J. (2010). Using Support Vector Machines for Generating Synthetic Datasets. In: Domingo-Ferrer, J., Magkos, E. (eds) Privacy in Statistical Databases. PSD 2010. Lecture Notes in Computer Science, vol 6344. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15838-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-642-15838-4_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15837-7
Online ISBN: 978-3-642-15838-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Using Support Vector Machines for Generating Synthetic Datasets

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Resampling methods for generating continuous multivariate synthetic data for disclosure control

Advantages of Imputation vs. Data Swapping for Statistical Disclosure Control

Inference for Singly Imputed Synthetic Data Based on Posterior Predictive Sampling under Multivariate Normal and Multiple Linear Regression Models

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Using Support Vector Machines for Generating Synthetic Datasets

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Resampling methods for generating continuous multivariate synthetic data for disclosure control

Advantages of Imputation vs. Data Swapping for Statistical Disclosure Control

Inference for Singly Imputed Synthetic Data Based on Posterior Predictive Sampling under Multivariate Normal and Multiple Linear Regression Models

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation