[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3543507.3583297acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article
Open access

Preserving Missing Data Distribution in Synthetic Data

Published: 30 April 2023 Publication History

Abstract

Data from Web artifacts and from the Web is often sensitive and cannot be directly shared for data analysis. Therefore, synthetic data generated from the real data is increasingly used as a privacy-preserving substitute. In many cases, real data from the web has missing values where the missingness itself possesses important informational content, which domain experts leverage to improve their analysis. However, this information content is lost if either imputation or deletion is used before synthetic data generation. In this paper, we propose several methods to generate synthetic data that preserve both the observable and the missing data distributions. An extensive empirical evaluation over a range of carefully fabricated and real world datasets demonstrates the effectiveness of our approach.

References

[1]
Accountability Act. 1996. Health insurance portability and accountability act of 1996. Public law 104 (1996), 191.
[2]
Douglas G Altman and J Martin Bland. 2007. Missing data. Bmj 334, 7590 (2007), 424–424.
[3]
Hafiz Asif and Jaideep Vaidya. 2022. A Study of Users’ Privacy Preferences for Data Sharing on Symptoms-Tracking/Health App. In Proceedings of the 21th Workshop on Workshop on Privacy in the Electronic Society(WPES ’22). ACM.
[4]
Yu Bai, Tengyu Ma, and Andrej Risteski. 2018. Approximability of discriminators implies diversity in GANs. arXiv preprint arXiv:1806.10586 (2018).
[5]
Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawelczyk, and Gjergji Kasneci. 2021. Deep neural networks and tabular data: A survey. arXiv preprint arXiv:2110.01889 (2021).
[6]
Andrew P Bradley. 1997. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition 30, 7 (1997), 1145–1159.
[7]
Kuntai Cai, Xiaoyu Lei, Jianxin Wei, and Xiaokui Xiao. 2021. Data synthesis via differentially private markov random fields. Proceedings of the VLDB Endowment 14, 11 (2021), 2190–2202.
[8]
Zhipeng Cai, Zuobin Xiong, Honghui Xu, Peng Wang, Wei Li, and Yi Pan. 2021. Generative adversarial networks: A survey toward private and secure applications. ACM Computing Surveys (CSUR) 54, 6 (2021), 1–38.
[9]
California State Legislature Website. 2018. SB-1121 California Consumer Privacy Act of 2018. https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml¿ bill_id=201720180SB1121, Last accessed on 2022-10.
[10]
Dingfan Chen, Ning Yu, Yang Zhang, and Mario Fritz. 2020. Gan-leaks: A taxonomy of membership inference attacks against generative models. In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security. 343–362.
[11]
Xinyu Chen, Zhaocheng He, Yixian Chen, Yuhuan Lu, and Jiawei Wang. 2019. Missing traffic data imputation and pattern discovery with a Bayesian augmented tensor factorization model. Transportation Research Part C: Emerging Technologies 104 (2019), 66–77.
[12]
Jocelyn T Chi, Eric C Chi, and Richard G Baraniuk. 2016. k-pod: A method for k-means clustering of missing data. The American Statistician 70, 1 (2016), 91–99.
[13]
Mark Collier, Alfredo Nazabal, and Christopher KI Williams. 2020. VAEs in the presence of missing data. arXiv preprint arXiv:2006.05301 (2020).
[14]
Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning 20, 3 (1995), 273–297.
[15]
Jessamyn Dahmen and Diane Cook. 2019. SynSys: A synthetic data generation system for healthcare applications. Sensors 19, 5 (2019), 1181.
[16]
James Davidson, Benjamin Liebald, Junning Liu, Palash Nandy, Taylor Van Vleet, Ullas Gargi, Sujoy Gupta, Yu He, Mike Lambert, Blake Livingston, 2010. The YouTube video recommendation system. In Proceedings of the fourth ACM conference on Recommender systems. 293–296.
[17]
Ishan Deshpande, Ziyu Zhang, and Alexander G Schwing. 2018. Generative modeling using the sliced wasserstein distance. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3483–3491.
[18]
EU Directive. 1995. 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data. Official Journal of the EC 23, 6 (1995).
[19]
Cynthia Dwork. 2008. Differential privacy: A survey of results. In International conference on theory and applications of models of computation. Springer, 1–19.
[20]
Ju Fan, Junyou Chen, Tongyu Liu, Yuwei Shen, Guoliang Li, and Xiaoyong Du. 2020. Relational Data Synthesis Using Generative Adversarial Networks: A Design Space Exploration. Proc. VLDB Endow. 13, 12 (jul 2020), 1962–1975. https://doi.org/10.14778/3407790.3407802
[21]
Ju Fan, Tongyu Liu, Guoliang Li, Junyou Chen, Yuwei Shen, and Xiaoyong Du. 2020. Relational data synthesis using generative adversarial networks: A design space exploration. arXiv preprint arXiv:2008.12763 (2020).
[22]
Germain Forestier, François Petitjean, Hoang Anh Dau, Geoffrey I Webb, and Eamonn Keogh. 2017. Generating synthetic time series to augment sparse datasets. In 2017 IEEE international conference on data mining (ICDM). IEEE, 865–870.
[23]
Maayan Frid-Adar, Idit Diamant, Eyal Klang, Michal Amitai, Jacob Goldberger, and Hayit Greenspan. 2018. GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing 321 (2018), 321–331.
[24]
Nan Gao, Hao Xue, Wei Shao, Sichen Zhao, Kyle Kai Qin, Arian Prabowo, Mohammad Saiedur Rahaman, and Flora D Salim. 2022. Generative adversarial networks for spatio-temporal data: A survey. ACM Transactions on Intelligent Systems and Technology (TIST) 13, 2 (2022), 1–25.
[25]
Pedro J García-Laencina, José-Luis Sancho-Gómez, and Aníbal R Figueiras-Vidal. 2010. Pattern classification with missing data: a review. Neural Computing and Applications 19, 2 (2010), 263–282.
[26]
Chang Ge, Shubhankar Mohapatra, Xi He, and Ihab F. Ilyas. 2021. Kamino: Constraint-Aware Differentially Private Data Synthesis. Proc. VLDB Endow. 14, 10 (jun 2021), 1886–1899. https://doi.org/10.14778/3467861.3467876
[27]
Chang Ge, Shubhankar Mohapatra, Xi He, and Ihab F Ilyas. 2021. Kamino: Constraint-aware differentially private data synthesis. Proceedings of the VLDB Endowment 14, 10 (2021), 1886–1899.
[28]
Andre Goncalves, Priyadip Ray, Braden Soper, Jennifer Stevens, Linda Coyle, and Ana Paula Sales. 2020. Generation and evaluation of synthetic patient data. BMC medical research methodology 20, 1 (2020), 1–40.
[29]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. Advances in neural information processing systems 27 (2014).
[30]
Jamie Hayes, Luca Melis, George Danezis, and Emiliano De Cristofaro. 2017. Logan: Membership inference attacks against generative models. arXiv preprint arXiv:1705.07663 (2017).
[31]
Daphne Koller and Nir Friedman. 2009. Probabilistic graphical models: principles and techniques. MIT press.
[32]
Steven Cheng-Xian Li, Bo Jiang, and Benjamin Marlin. 2019. Misgan: Learning from incomplete data with generative adversarial networks. arXiv preprint arXiv:1902.09599 (2019).
[33]
Steven Cheng-Xian Li and Benjamin Marlin. 2020. Learning from irregularly-sampled time series: A missing data perspective. In International Conference on Machine Learning. PMLR, 5937–5946.
[34]
Wanxin Li. 2020. Supporting Database Constraints in Synthetic Data Generation based on Generative Adversarial Networks. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. 2875–2877.
[35]
Jau-Huei Lin and Peter J Haug. 2008. Exploiting missing clinical data in Bayesian network modeling for predicting medical problems. Journal of biomedical informatics 41, 1 (2008), 1–14.
[36]
Zinan Lin, Vyas Sekar, and Giulia Fanti. 2021. On the privacy properties of gan-generated samples. In International Conference on Artificial Intelligence and Statistics. PMLR, 1522–1530.
[37]
Roderick JA Little. 1988. A test of missing completely at random for multivariate data with missing values. Journal of the American statistical Association 83, 404 (1988), 1198–1202.
[38]
Angela Lupattelli, Mollie E Wood, and Hedvig Nordeng. 2019. Analyzing missing data in perinatal pharmacoepidemiology research: methodological considerations to limit the risk of bias. Clinical Therapeutics 41, 12 (2019), 2477–2487.
[39]
Chao Ma and Cheng Zhang. 2021. Identifiable Generative Models for Missing Not at Random Data Imputation. Advances in Neural Information Processing Systems 34 (2021).
[40]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579–2605.
[41]
Marc Mangel and Francisco J Samaniego. 1984. Abraham Wald’s work on aircraft survivability. J. Amer. Statist. Assoc. 79, 386 (1984), 259–267.
[42]
Patrick E McKnight, Katherine M McKnight, Souraya Sidani, and Aurelio Jose Figueredo. 2007. Missing data: A gentle introduction. Guilford Press.
[43]
Kohbalan Moorthy, Aws Naser Jaber, Mohd Arfian Ismail, Ferda Ernawan, Mohd Saberi Mohamad, and Safaai Deris. 2019. Missing-values imputation algorithms for microarray gene expression data. Microarray Bioinformatics (2019), 255–266.
[44]
Diogo Telmo Neves, João Alves, Marcel Ganesh Naik, Alberto José Proença, and Fabian Prasser. 2022. From missing data imputation to data generation. Journal of Computational Science (2022), 101640.
[45]
Zhaoqing Pan, Weijie Yu, Xiaokai Yi, Asifullah Khan, Feng Yuan, and Yuhui Zheng. 2019. Recent progress on generative adversarial networks (GANs): A survey. IEEE Access 7 (2019), 36322–36333.
[46]
Noseong Park, Mahmoud Mohammadi, Kshitij Gorde, Sushil Jajodia, Hongkyu Park, and Youngmin Kim. 2018. Data synthesis based on generative adversarial networks. Proceedings of the VLDB Endowment (2018).
[47]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
[48]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
[49]
Soorya Prakash. 2021. Brain Tumor Dataset. https://www.kaggle.com/datasets/sooryaprakash12/texephyr.
[50]
Donald B Rubin. 2004. Multiple imputation for nonresponse in surveys. Vol. 81. John Wiley & Sons.
[51]
Marco Scutari. 2017. Bayesian Network Constraint-Based Structure Learning Algorithms: Parallel and Optimized Implementations in the bnlearn R Package. Journal of Statistical Software 77, 2 (2017), 1–20. https://doi.org/10.18637/jss.v077.i02
[52]
Kshitij Singh. 2021. Retail Prices Of Commodities In India. https://www.kaggle.com/datasets/kk9969/retail-prices-of-commodities-in-india¿select=Monthly_Food_Retail_Prices.csv.
[53]
Latanya Sweeney. 2002. k-anonymity: A model for protecting privacy. International journal of uncertainty, fuzziness and knowledge-based systems 10, 05 (2002), 557–570.
[54]
Nicholas Tierney, Di Cook, Miles McBain, and Colin Fay. 2021. naniar: Data Structures, Summaries, and Visualisations for Missing Data. https://CRAN.R-project.org/package=naniar R package version 0.6.1.
[55]
Roman Timofeev. 2004. Classification and regression trees (CART) theory and applications. Humboldt University, Berlin 54 (2004).
[56]
Allan Tucker, Zhenchen Wang, Ylenia Rotalinti, and Puja Myles. 2020. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. NPJ digital medicine 3, 1 (2020), 1–13.
[57]
Jaideep Vaidya, Basit Shafiq, Muazzam Asani, Nabil Adam, Xiaoqian Jiang, and Lucila Ohno-Machado. 2017. A scalable privacy-preserving data generation methodology for exploratory analysis. In AMIA Annual Symposium Proceedings, Vol. 2017. American Medical Informatics Association, 1695.
[58]
Francesco Ventura, Zoi Kaoudi, Jorge Arnulfo Quiané-Ruiz, and Volker Markl. 2021. Expand your Training Limits! Generating Training Data for ML-based Data Management. In Proceedings of the 2021 International Conference on Management of Data. 1865–1878.
[59]
Kiri Wagstaff. 2004. Clustering with missing values: No imputation required. In Classification, clustering, and data mining applications. Springer, 649–658.
[60]
Jizhe Wang, Pipei Huang, Huan Zhao, Zhibo Zhang, Binqiang Zhao, and Dik Lun Lee. 2018. Billion-scale commodity embedding for e-commerce recommendation in alibaba. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 839–848.
[61]
Xiang Wang, Yaokun Xu, Xiangnan He, Yixin Cao, Meng Wang, and Tat-Seng Chua. 2020. Reinforced negative sampling over knowledge graph for recommendation. In Proceedings of the web conference 2020. 99–109.
[62]
Zhenchen Wang, Puja Myles, and Allan Tucker. 2019. Generating and evaluating synthetic UK primary care data: preserving data utility & patient privacy. In 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS). IEEE, 126–131.
[63]
David Williams, Xuejun Liao, Ya Xue, Lawrence Carin, and Balaji Krishnapuram. 2007. On classification with incomplete data. IEEE transactions on pattern analysis and machine intelligence 29, 3 (2007), 427–436.
[64]
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. 2019. Modeling tabular data using conditional gan. Advances in Neural Information Processing Systems 32 (2019).
[65]
Lei Xu and Kalyan Veeramachaneni. 2018. Synthesizing tabular data using generative adversarial networks. arXiv preprint arXiv:1811.11264 (2018).
[66]
Andrew Yale, Saloni Dash, Ritik Dutta, Isabelle Guyon, Adrien Pavao, and Kristin Bennett. 2019. Privacy preserving synthetic health data. In ESANN 2019-European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning.
[67]
Yang Yang, Zhuangdi Xu, and Dandan Song. 2016. Missing value imputation for microRNA expression data by using a GO-based similarity measure. In BMC bioinformatics, Vol. 17. BioMed Central, 109–116.
[68]
Jinsung Yoon, James Jordon, and Mihaela Schaar. 2018. Gain: Missing data imputation using generative adversarial nets. In International conference on machine learning. PMLR, 5689–5698.
[69]
Wenhao Zhang, Wentian Bao, Xiao-Yang Liu, Keping Yang, Quan Lin, Hong Wen, and Ramin Ramezani. 2020. Large-scale causal approaches to debiasing post-click conversion rate estimation with multi-task learning. In Proceedings of The Web Conference 2020. 2775–2781.
[70]
Yizhe Zhang, Zhe Gan, and Lawrence Carin. 2016. Generating text via adversarial training. In NIPS workshop on Adversarial Training, Vol. 21. academia. edu, 21–32.
[71]
Yanju Zhang, Yue Wang, and Shiqin Wang. 2020. Improvement of collaborative filtering recommendation algorithm based on intuitionistic fuzzy reasoning under missing data. IEEE Access 8 (2020), 51324–51332.

Cited By

View all
  • (2024)High-Fidelity Synthetic Data Applications for Data AugmentationDeep Learning - Recent Findings and Research10.5772/intechopen.113884Online publication date: 29-May-2024
  • (2024)Veil Privacy on Visual Data: Concealing Privacy for Humans, Unveiling for DNNsComputer Vision – ECCV 202410.1007/978-3-031-73010-8_17(280-297)Online publication date: 10-Nov-2024

Index Terms

  1. Preserving Missing Data Distribution in Synthetic Data

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WWW '23: Proceedings of the ACM Web Conference 2023
    April 2023
    4293 pages
    ISBN:9781450394161
    DOI:10.1145/3543507
    This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 April 2023

    Check for updates

    Author Tags

    1. GAN
    2. Missing Data
    3. Privacy
    4. Synthetic Data Generation

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    WWW '23
    Sponsor:
    WWW '23: The ACM Web Conference 2023
    April 30 - May 4, 2023
    TX, Austin, USA

    Acceptance Rates

    Overall Acceptance Rate 1,068 of 6,946 submissions, 15%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)614
    • Downloads (Last 6 weeks)65
    Reflects downloads up to 12 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)High-Fidelity Synthetic Data Applications for Data AugmentationDeep Learning - Recent Findings and Research10.5772/intechopen.113884Online publication date: 29-May-2024
    • (2024)Veil Privacy on Visual Data: Concealing Privacy for Humans, Unveiling for DNNsComputer Vision – ECCV 202410.1007/978-3-031-73010-8_17(280-297)Online publication date: 10-Nov-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media