More Web Proxy on the site http://driver.im/

research-article

Open access

Preserving Missing Data Distribution in Synthetic Data

Authors:

Jaideep VaidyaAuthors Info & Claims

WWW '23: Proceedings of the ACM Web Conference 2023

Pages 2110 - 2121

https://doi.org/10.1145/3543507.3583297

Published: 30 April 2023 Publication History

All formats PDF

Abstract

Data from Web artifacts and from the Web is often sensitive and cannot be directly shared for data analysis. Therefore, synthetic data generated from the real data is increasingly used as a privacy-preserving substitute. In many cases, real data from the web has missing values where the missingness itself possesses important informational content, which domain experts leverage to improve their analysis. However, this information content is lost if either imputation or deletion is used before synthetic data generation. In this paper, we propose several methods to generate synthetic data that preserve both the observable and the missing data distributions. An extensive empirical evaluation over a range of carefully fabricated and real world datasets demonstrates the effectiveness of our approach.

References

[1]

Accountability Act. 1996. Health insurance portability and accountability act of 1996. Public law 104 (1996), 191.

[2]

Douglas G Altman and J Martin Bland. 2007. Missing data. Bmj 334, 7590 (2007), 424–424.

[3]

Hafiz Asif and Jaideep Vaidya. 2022. A Study of Users’ Privacy Preferences for Data Sharing on Symptoms-Tracking/Health App. In Proceedings of the 21th Workshop on Workshop on Privacy in the Electronic Society(WPES ’22). ACM.

Digital Library

[4]

Yu Bai, Tengyu Ma, and Andrej Risteski. 2018. Approximability of discriminators implies diversity in GANs. arXiv preprint arXiv:1806.10586 (2018).

[5]

Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawelczyk, and Gjergji Kasneci. 2021. Deep neural networks and tabular data: A survey. arXiv preprint arXiv:2110.01889 (2021).

[6]

Andrew P Bradley. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition 30, 7 (1997), 1145–1159.

Digital Library

[7]

Kuntai Cai, Xiaoyu Lei, Jianxin Wei, and Xiaokui Xiao. 2021. Data synthesis via differentially private markov random fields. Proceedings of the VLDB Endowment 14, 11 (2021), 2190–2202.

Digital Library

[8]

Zhipeng Cai, Zuobin Xiong, Honghui Xu, Peng Wang, Wei Li, and Yi Pan. 2021. Generative adversarial networks: A survey toward private and secure applications. ACM Computing Surveys (CSUR) 54, 6 (2021), 1–38.

Digital Library

[9]

California State Legislature Website. 2018. SB-1121 California Consumer Privacy Act of 2018. https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml¿ bill_id=201720180SB1121, Last accessed on 2022-10.

[10]

Dingfan Chen, Ning Yu, Yang Zhang, and Mario Fritz. 2020. Gan-leaks: A taxonomy of membership inference attacks against generative models. In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security. 343–362.

Digital Library

[11]

Xinyu Chen, Zhaocheng He, Yixian Chen, Yuhuan Lu, and Jiawei Wang. 2019. Missing traffic data imputation and pattern discovery with a Bayesian augmented tensor factorization model. Transportation Research Part C: Emerging Technologies 104 (2019), 66–77.

[12]

Jocelyn T Chi, Eric C Chi, and Richard G Baraniuk. 2016. k-pod: A method for k-means clustering of missing data. The American Statistician 70, 1 (2016), 91–99.

[13]

Mark Collier, Alfredo Nazabal, and Christopher KI Williams. 2020. VAEs in the presence of missing data. arXiv preprint arXiv:2006.05301 (2020).

[14]

Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning 20, 3 (1995), 273–297.

Digital Library

[15]

Jessamyn Dahmen and Diane Cook. 2019. SynSys: A synthetic data generation system for healthcare applications. Sensors 19, 5 (2019), 1181.

[16]

James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, 2010. The YouTube video recommendation system. In Proceedings of the fourth ACM conference on Recommender systems. 293–296.

Digital Library

[17]

Ishan Deshpande, Ziyu Zhang, and Alexander G Schwing. 2018. Generative modeling using the sliced wasserstein distance. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3483–3491.

[18]

EU Directive. 1995. 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data. Official Journal of the EC 23, 6 (1995).

[19]

Cynthia Dwork. 2008. Differential privacy: A survey of results. In International conference on theory and applications of models of computation. Springer, 1–19.

[20]

Ju Fan, Junyou Chen, Tongyu Liu, Yuwei Shen, Guoliang Li, and Xiaoyong Du. 2020. Relational Data Synthesis Using Generative Adversarial Networks: A Design Space Exploration. Proc. VLDB Endow. 13, 12 (jul 2020), 1962–1975. https://doi.org/10.14778/3407790.3407802

Digital Library

[21]

Ju Fan, Tongyu Liu, Guoliang Li, Junyou Chen, Yuwei Shen, and Xiaoyong Du. 2020. Relational data synthesis using generative adversarial networks: A design space exploration. arXiv preprint arXiv:2008.12763 (2020).

[22]

Germain Forestier, François Petitjean, Hoang Anh Dau, Geoffrey I Webb, and Eamonn Keogh. 2017. Generating synthetic time series to augment sparse datasets. In 2017 IEEE international conference on data mining (ICDM). IEEE, 865–870.

[23]

Maayan Frid-Adar, Idit Diamant, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit Greenspan. 2018. GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing 321 (2018), 321–331.

[24]

Nan Gao, Hao Xue, Wei Shao, Sichen Zhao, Kyle Kai Qin, Arian Prabowo, Mohammad Saiedur Rahaman, and Flora D Salim. 2022. Generative adversarial networks for spatio-temporal data: A survey. ACM Transactions on Intelligent Systems and Technology (TIST) 13, 2 (2022), 1–25.

Digital Library

[25]

Pedro J García-Laencina, José-Luis Sancho-Gómez, and Aníbal R Figueiras-Vidal. 2010. Pattern classification with missing data: a review. Neural Computing and Applications 19, 2 (2010), 263–282.

Digital Library

[26]

Chang Ge, Shubhankar Mohapatra, Xi He, and Ihab F. Ilyas. 2021. Kamino: Constraint-Aware Differentially Private Data Synthesis. Proc. VLDB Endow. 14, 10 (jun 2021), 1886–1899. https://doi.org/10.14778/3467861.3467876

Digital Library

[27]

Chang Ge, Shubhankar Mohapatra, Xi He, and Ihab F Ilyas. 2021. Kamino: Constraint-aware differentially private data synthesis. Proceedings of the VLDB Endowment 14, 10 (2021), 1886–1899.

Digital Library

[28]

Andre Goncalves, Priyadip Ray, Braden Soper, Jennifer Stevens, Linda Coyle, and Ana Paula Sales. 2020. Generation and evaluation of synthetic patient data. BMC medical research methodology 20, 1 (2020), 1–40.

[29]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).

[30]

Jamie Hayes, Luca Melis, George Danezis, and Emiliano De Cristofaro. 2017. Logan: Membership inference attacks against generative models. arXiv preprint arXiv:1705.07663 (2017).

[31]

Daphne Koller and Nir Friedman. 2009. Probabilistic graphical models: principles and techniques. MIT press.

Digital Library

[32]

Steven Cheng-Xian Li, Bo Jiang, and Benjamin Marlin. 2019. Misgan: Learning from incomplete data with generative adversarial networks. arXiv preprint arXiv:1902.09599 (2019).

[33]

Steven Cheng-Xian Li and Benjamin Marlin. 2020. Learning from irregularly-sampled time series: A missing data perspective. In International Conference on Machine Learning. PMLR, 5937–5946.

[34]

Wanxin Li. 2020. Supporting Database Constraints in Synthetic Data Generation based on Generative Adversarial Networks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2875–2877.

Digital Library

[35]

Jau-Huei Lin and Peter J Haug. 2008. Exploiting missing clinical data in Bayesian network modeling for predicting medical problems. Journal of biomedical informatics 41, 1 (2008), 1–14.

Digital Library

[36]

Zinan Lin, Vyas Sekar, and Giulia Fanti. 2021. On the privacy properties of gan-generated samples. In International Conference on Artificial Intelligence and Statistics. PMLR, 1522–1530.

[37]

Roderick JA Little. 1988. A test of missing completely at random for multivariate data with missing values. Journal of the American statistical Association 83, 404 (1988), 1198–1202.

[38]

Angela Lupattelli, Mollie E Wood, and Hedvig Nordeng. 2019. Analyzing missing data in perinatal pharmacoepidemiology research: methodological considerations to limit the risk of bias. Clinical Therapeutics 41, 12 (2019), 2477–2487.

[39]

Chao Ma and Cheng Zhang. 2021. Identifiable Generative Models for Missing Not at Random Data Imputation. Advances in Neural Information Processing Systems 34 (2021).

[40]

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579–2605.

[41]

Marc Mangel and Francisco J Samaniego. 1984. Abraham Wald’s work on aircraft survivability. J. Amer. Statist. Assoc. 79, 386 (1984), 259–267.

[42]

Patrick E McKnight, Katherine M McKnight, Souraya Sidani, and Aurelio Jose Figueredo. 2007. Missing data: A gentle introduction. Guilford Press.

[43]

Kohbalan Moorthy, Aws Naser Jaber, Mohd Arfian Ismail, Ferda Ernawan, Mohd Saberi Mohamad, and Safaai Deris. 2019. Missing-values imputation algorithms for microarray gene expression data. Microarray Bioinformatics (2019), 255–266.

[44]

Diogo Telmo Neves, João Alves, Marcel Ganesh Naik, Alberto José Proença, and Fabian Prasser. 2022. From missing data imputation to data generation. Journal of Computational Science (2022), 101640.

[45]

Zhaoqing Pan, Weijie Yu, Xiaokai Yi, Asifullah Khan, Feng Yuan, and Yuhui Zheng. 2019. Recent progress on generative adversarial networks (GANs): A survey. IEEE Access 7 (2019), 36322–36333.

[46]

Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, and Youngmin Kim. 2018. Data synthesis based on generative adversarial networks. Proceedings of the VLDB Endowment (2018).

Digital Library

[47]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).

[48]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.

Digital Library

[49]

Soorya Prakash. 2021. Brain Tumor Dataset. https://www.kaggle.com/datasets/sooryaprakash12/texephyr.

[50]

Donald B Rubin. 2004. Multiple imputation for nonresponse in surveys. Vol. 81. John Wiley & Sons.

[51]

Marco Scutari. 2017. Bayesian Network Constraint-Based Structure Learning Algorithms: Parallel and Optimized Implementations in the bnlearn R Package. Journal of Statistical Software 77, 2 (2017), 1–20. https://doi.org/10.18637/jss.v077.i02

[52]

Kshitij Singh. 2021. Retail Prices Of Commodities In India. https://www.kaggle.com/datasets/kk9969/retail-prices-of-commodities-in-india¿select=Monthly_Food_Retail_Prices.csv.

[53]

Latanya Sweeney. 2002. k-anonymity: A model for protecting privacy. International journal of uncertainty, fuzziness and knowledge-based systems 10, 05 (2002), 557–570.

Digital Library

[54]

Nicholas Tierney, Di Cook, Miles McBain, and Colin Fay. 2021. naniar: Data Structures, Summaries, and Visualisations for Missing Data. https://CRAN.R-project.org/package=naniar R package version 0.6.1.

[55]

Roman Timofeev. 2004. Classification and regression trees (CART) theory and applications. Humboldt University, Berlin 54 (2004).

[56]

Allan Tucker, Zhenchen Wang, Ylenia Rotalinti, and Puja Myles. 2020. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. NPJ digital medicine 3, 1 (2020), 1–13.

[57]

Jaideep Vaidya, Basit Shafiq, Muazzam Asani, Nabil Adam, Xiaoqian Jiang, and Lucila Ohno-Machado. 2017. A scalable privacy-preserving data generation methodology for exploratory analysis. In AMIA Annual Symposium Proceedings, Vol. 2017. American Medical Informatics Association, 1695.

[58]

Francesco Ventura, Zoi Kaoudi, Jorge Arnulfo Quiané-Ruiz, and Volker Markl. 2021. Expand your Training Limits! Generating Training Data for ML-based Data Management. In Proceedings of the 2021 International Conference on Management of Data. 1865–1878.

Digital Library

[59]

Kiri Wagstaff. 2004. Clustering with missing values: No imputation required. In Classification, clustering, and data mining applications. Springer, 649–658.

[60]

Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao, and Dik Lun Lee. 2018. Billion-scale commodity embedding for e-commerce recommendation in alibaba. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 839–848.

Digital Library

[61]

Xiang Wang, Yaokun Xu, Xiangnan He, Yixin Cao, Meng Wang, and Tat-Seng Chua. 2020. Reinforced negative sampling over knowledge graph for recommendation. In Proceedings of the web conference 2020. 99–109.

Digital Library

[62]

Zhenchen Wang, Puja Myles, and Allan Tucker. 2019. Generating and evaluating synthetic UK primary care data: preserving data utility & patient privacy. In 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS). IEEE, 126–131.

[63]

David Williams, Xuejun Liao, Ya Xue, Lawrence Carin, and Balaji Krishnapuram. 2007. On classification with incomplete data. IEEE transactions on pattern analysis and machine intelligence 29, 3 (2007), 427–436.

Digital Library

[64]

Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling tabular data using conditional gan. Advances in Neural Information Processing Systems 32 (2019).

[65]

Lei Xu and Kalyan Veeramachaneni. 2018. Synthesizing tabular data using generative adversarial networks. arXiv preprint arXiv:1811.11264 (2018).

[66]

Andrew Yale, Saloni Dash, Ritik Dutta, Isabelle Guyon, Adrien Pavao, and Kristin Bennett. 2019. Privacy preserving synthetic health data. In ESANN 2019-European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning.

[67]

Yang Yang, Zhuangdi Xu, and Dandan Song. 2016. Missing value imputation for microRNA expression data by using a GO-based similarity measure. In BMC bioinformatics, Vol. 17. BioMed Central, 109–116.

[68]

Jinsung Yoon, James Jordon, and Mihaela Schaar. 2018. Gain: Missing data imputation using generative adversarial nets. In International conference on machine learning. PMLR, 5689–5698.

[69]

Wenhao Zhang, Wentian Bao, Xiao-Yang Liu, Keping Yang, Quan Lin, Hong Wen, and Ramin Ramezani. 2020. Large-scale causal approaches to debiasing post-click conversion rate estimation with multi-task learning. In Proceedings of The Web Conference 2020. 2775–2781.

Digital Library

[70]

Yizhe Zhang, Zhe Gan, and Lawrence Carin. 2016. Generating text via adversarial training. In NIPS workshop on Adversarial Training, Vol. 21. academia. edu, 21–32.

[71]

Yanju Zhang, Yue Wang, and Shiqin Wang. 2020. Improvement of collaborative filtering recommendation algorithm based on intuitionistic fuzzy reasoning under missing data. IEEE Access 8 (2020), 51324–51332.

Cited By

Wang ZDraghi BRotalinti YLunn DMyles P(2024)High-Fidelity Synthetic Data Applications for Data AugmentationDeep Learning - Recent Findings and Research10.5772/intechopen.113884Online publication date: 29-May-2024
https://doi.org/10.5772/intechopen.113884
Pang SMa RLi BZhou YYao Y(2024)Veil Privacy on Visual Data: Concealing Privacy for Humans, Unveiling for DNNsComputer Vision – ECCV 202410.1007/978-3-031-73010-8_17(280-297)Online publication date: 10-Nov-2024
https://doi.org/10.1007/978-3-031-73010-8_17

Index Terms

Preserving Missing Data Distribution in Synthetic Data
1. Security and privacy
  1. Database and storage security
    1. Data anonymization and sanitization

Recommendations

Four Factors Affecting Missing Data Imputation
SSDBM '23: Proceedings of the 35th International Conference on Scientific and Statistical Database Management

Missing data is a common problem in datasets and impacts the reliability of data analysis. Numerous methods to impute (i.e., predict and replace) missing values have been proposed. The quality of these imputed values depends on factors like correlation,...
A Novel Evaluation Metric for Synthetic Data Generation
Intelligent Data Engineering and Automated Learning – IDEAL 2020
Abstract
Differentially private algorithmic synthetic data generation (SDG) solutions take input datasets consisting of sensitive, private data and generate synthetic data with similar qualities. The importance of such solutions is increasing both because ...
Privacy-Preserving Synthetic Data Generation for Recommendation Systems
SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Recommendation systems make predictions chiefly based on users' historical interaction data (e.g., items previously clicked or purchased). There is a risk of privacy leakage when collecting the users' behavior data for building the recommendation model. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '23: Proceedings of the ACM Web Conference 2023

April 2023

4293 pages

ISBN:9781450394161

DOI:10.1145/3543507

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 April 2023

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Institutes of Health

Conference

WWW '23

Sponsor:

SIGWEB

WWW '23: The ACM Web Conference 2023

April 30 - May 4, 2023

TX, Austin, USA

Acceptance Rates

Overall Acceptance Rate 1,068 of 6,946 submissions, 15%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
1,110
Total Downloads

Downloads (Last 12 months)614
Downloads (Last 6 weeks)65

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang ZDraghi BRotalinti YLunn DMyles P(2024)High-Fidelity Synthetic Data Applications for Data AugmentationDeep Learning - Recent Findings and Research10.5772/intechopen.113884Online publication date: 29-May-2024
https://doi.org/10.5772/intechopen.113884
Pang SMa RLi BZhou YYao Y(2024)Veil Privacy on Visual Data: Concealing Privacy for Humans, Unveiling for DNNsComputer Vision – ECCV 202410.1007/978-3-031-73010-8_17(280-297)Online publication date: 10-Nov-2024
https://doi.org/10.1007/978-3-031-73010-8_17

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents