Impacts of Data Synthesis: A Metric for Quantifiable Data Standards and Performances
<p>Data synthesis, sharing, and testing use case.</p> "> Figure 2
<p>Pearson correlation for original data set in the “lower” triangle and <span class="html-italic">SynD1</span> data set in the “upper” triangle.</p> "> Figure 3
<p>Pearson correlation for original data set in the “lower” triangle and <span class="html-italic">SynW1</span> data set in the “upper” triangle.</p> "> Figure 4
<p>Relative frequency distributions of a few original (observed) and <span class="html-italic">SynD1</span> (synthetic) data set variables.</p> "> Figure 5
<p>Relative frequency distribution of a few original (observed) and <span class="html-italic">SynW1</span> (synthetic) data set variables.</p> "> Figure 6
<p>Uniform Manifold Approximation and Projection for original DIPP data set.</p> "> Figure 7
<p>Uniform Manifold Approximation and Projection for <span class="html-italic">SynD1</span> data set.</p> "> Figure 8
<p>Uniform Manifold Approximation and Projection for original WDBC data set.</p> "> Figure 9
<p>Uniform Manifold Approximation and Projection for <span class="html-italic">SynW1</span> data set.</p> "> Figure A1
<p>Entropy per bit for original and <span class="html-italic">SynD1</span> data variables.</p> "> Figure A2
<p>Entropy per bit for original and synthetic data variables.</p> ">
Abstract
:1. Introduction
2. Related Work
3. Methodologies
3.1. Synthpop
3.1.1. Methods for Synthesis
3.1.2. Controlling the Sequence and Prediction
3.1.3. Handling Data with Restricted and Missing Values
3.2. Utility Measures of Data
3.2.1. General and Specific Utility Measures
3.3. Quality of Information
4. Experiments and Results
4.1. Specific and General Utility
4.2. Quality of Information
5. Conclusions and Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
CART | Classification And Regression Tree |
DIPP | Finnish Type 1 Diabetes Prediction and Prevention study database |
EHR | Electronic Health Records |
FN | False Negative |
FP | False Positive |
GAN | Generative adversarial networks |
GBM | Gradient Boosting Machine |
GDPR | General Data Protection Regulation |
k-NN | k-Nearest Neighbor |
MI | Mutual Information |
ML | Machine Learning |
PPMCC | Pearson product-moment correlation coefficient |
RF | Random Forest |
ROC | Receiver Operating Characteristic |
SYLLS | Synthetic Data Estimation for UK Longitudinal Studies |
TN | True Negative |
TP | True Positive |
t-SNE | T-distributed Stochastic Neighbor Embedding |
UMAP | Uniform Manifold Approximation and Projection |
WDBC | Wisconsin Diagnostic Breast Cancer data set |
Appendix A
Appendix A.1. Wisconsin Diagnostic Breast Cancer Data Set
- radius (mean of distances from the center to points on the perimeter)
- texture (standard deviation of grey-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter/area − 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension ("coastline approximation" − 1)
Type 1 Diabetes Prediction and Prevention Data Set
Attributes | Description |
---|---|
POS_antibodies | Response variable— 1 the child had two or more consecutive positive samples in any of the auto-antibodies, 0 otherwise |
length | Length at birth (cm) |
weight | Weight at birth (g) |
circle_of_head | Head circumference measured at birth (cm) |
ratio_head_length | Ratio between head circumference and length measured at birth (cm) |
Mom_birth_age | Age of mother at the time of birth (years) |
height_growth | Growth rate calculated by: (height measured in the last visit—length at birth)/Age in months |
weight_growth | Growth rate calculated by: (weight measured in the last visit—birth weight/1000)/Age in months |
GADA.UUSI | Maximum value of GADA antibody that occurred before 12 months old (negative value) |
mIAA.3.470 | Maximum value of IAA antibody that occurred before 12 months old (negative value) |
IAA.0.42 | Maximum value of IA2A antibody that occurred before 12 months old (negative value) |
s.gender | Gender 1—male, 2—female |
duration | Pregnancy duration: 0—pre term 0 to 37 weeks, 1—normal 37 to 42 weeks, 2—post-term > 42 weeks |
month | Month of birth—from 1 to 12 |
mother_antib | 1—if the child’s mother had positive autoantibodies, 0 otherwise |
sibling_antib | 1—if the child’s sibling had positive autoantibodies, 0 otherwise |
has_sibling | 1—if the child has siblings, 0 otherwise |
is_mom_t1d | Does mom have t1d 1—yes, 0—no |
is_dad_t1d | Does dad have t1d 1—yes, 0—no |
v.breastfeeding_only | Age when exclusive breastfeeding has ended (months) |
v.breastfeeding_ended | Age when any breastfeeding has ended (months)—maximum is 12, which means currently still breastfeeding. |
i.infections_ear | 0—no ear infections, 1—1 infection, 2—more than 2 infections |
i.infections_eye | 0—no eye infections, 1—more than 1 infections |
i.infections_hospital_care | 0—no infections requiring a hospital stay, 1—more than 1 infections |
i.infections_airway | 0—no airway infections, 1—1 infection, 2—more than 2 infections |
i.infections_gastric | 0—no infections, 1—1 or more infections |
i.infections_other | |
i.infections_fever | |
i.infections_roseola | |
i.infections_chickenpox |
Appendix A.2. Miscellaneous Results
Attribute | KSp-Value | Cucconi p-Value |
---|---|---|
length | 0.7170990 | 0.603 |
weight | 0.7924978 | 0.403 |
circle_of_head | 1.0000000 | 0.914 |
ratio_head_length | 0.9937073 | 0.495 |
Mom_birth_age | 0.8930451 | 0.437 |
height_growth | 0.9438003 | 0.629 |
weight_growth | 0.7464065 | 0.472 |
GADA.UUSI | 0.8380866 | 0.784 |
mIAA.3.470 | 0.5239224 | 0.965 |
IA2A.0.42 | 0.8097315 | 0.383 |
month | 0.4346488 | 0.167 |
v.breastfeeding_only | 0.9999954 | 0.946 |
v.breastfeeding_ended | 0.9916316 | 0.981 |
Attribute | KSp-Value |
---|---|
radius_mean | 0.9613699 |
texture_mean | 0.9089228 |
perimeter_mean | 0.9613699 |
area_mean | 0.9383389 |
smoothness_mean | 0.5924107 |
compactness_mean | 0.9999932 |
concavity_mean | 0.9780573 |
concave.point_mean | 0.9613699 |
symmetry_mean | 0.8735816 |
fractal_dimension_mean | 0.9890057 |
radius_se | 0.8735816 |
texture_se | 0.9890057 |
perimeter_se | 0.9613699 |
area_se | 0.8735816 |
smoothness_se | 0.2048226 |
compactness_se | 0.9983954 |
concavity_se | 0.9780573 |
concave.point_se | 0.4076697 |
symmetry_se | 0.6921113 |
fractal_dimension_se | 0.9613699 |
radius_worst | 0.9089228 |
texture_worst | 0.9995891 |
perimeter_worst | 0.6421872 |
area_worst | 0.8735816 |
smoothness_worst | 0.4507638 |
compactness_worst | 0.9383389 |
concavity_worst | 0.9953208 |
concave.point_worst | 0.9983954 |
symmetry_worst | 0.7412813 |
fractal_dimension_worst | 0.9953208 |
References
- Sweeney, L. k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 2002, 10, 557–570. [Google Scholar] [CrossRef] [Green Version]
- Viceconti, M.; Hunter, P.; Hose, R. Big data, big knowledge: Big data for personalized healthcare. IEEE J. Biomed. Health Inform. 2015, 19, 1209–1215. [Google Scholar] [CrossRef] [PubMed]
- Ohm, P. Broken promises of privacy: Responding to the surprising failure of anonymization. UCLA Law Rev. 2009, 57, 1701. [Google Scholar]
- Huston, P.; Edge, V.; Bernier, E. Open Science/Open Data: Reaping the benefits of Open Data in public health. Can. Commun. Dis. Rep. 2019, 45, 252. [Google Scholar] [CrossRef] [PubMed]
- Singh, K.N.M.; Shetty, Y.C. Data sharing: A viable resource for future. Perspect. Clin. Res. 2017, 8, 63. [Google Scholar] [CrossRef] [PubMed]
- Devriendt, T.; Borry, P.; Shabani, M. Factors that influence data sharing through data sharing platforms: A qualitative study on the views and experiences of cohort holders and platform developers. PLoS ONE 2021, 16, e0254202. [Google Scholar] [CrossRef]
- Yale, A.; Dash, S.; Dutta, R.; Guyon, I.; Pavao, A.; Bennett, K.P. Privacy Preserving Synthetic Health Data. In Proceedings of the 2019 ESANN, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, 24–26April 2019; pp. 465–470. [Google Scholar]
- Finnish Type 1 Diabetes Prediction and Prevention. Available online: http://dipp.fi (accessed on 21 October 2022).
- Wolberg, W.H.; Mangasarian, O.L. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc. Natl. Acad. Sci. USA 1990, 87, 9193–9196. [Google Scholar] [CrossRef] [Green Version]
- Van Ginneken, A.M. The computerized patient record: Balancing effort and benefit. Int. J. Med. Inform. 2002, 65, 97–119. [Google Scholar] [CrossRef]
- Lee, J.; Park, Y.T.; Park, Y.R.; Lee, J.H. Review of national-level personal health records in advanced countries. Healthc. Inform. Res. 2021, 27, 102–109. [Google Scholar] [CrossRef]
- Kim, E.; Rubinstein, S.M.; Nead, K.T.; Wojcieszynski, A.P.; Gabriel, P.E.; Warner, J.L. The evolving use of electronic health records (EHR) for research. In Proceedings of the Seminars in Radiation Oncology; Elsevier: Amsterdam, The Netherlands, 2019; Volume 29, pp. 354–361. [Google Scholar]
- El Emam, K.; Jonker, E.; Arbuckle, L.; Malin, B. A systematic review of re-identification attacks on health data. PLoS ONE 2011, 6, e28071. [Google Scholar] [CrossRef] [Green Version]
- Greely, H.T. The Uneasy Ethical and Legal Underpinnings of Large-Scale Genomic Biobanks. Annu. Rev. Genom. Hum. Genet. 2007, 8, 343–364. [Google Scholar] [CrossRef]
- Fellegi, I.P. On the question of statistical confidentiality. J. Am. Stat. Assoc. 1972, 67, 7–18. [Google Scholar] [CrossRef]
- Denning, D.E. Secure statistical databases with random sample queries. ACM Trans. Database Syst. (TODS) 1980, 5, 295. [Google Scholar] [CrossRef] [Green Version]
- Samarati, P.; Sweeney, L. Protecting Privacy When Disclosing Information: K-Anonymity and Its Enforcement through Generalization and Suppression; Technical Report SRI-CSL-98-04; SRI Computer Science Laboratory: Palo Alto, CA, USA, 1998. [Google Scholar]
- Machanavajjhala, A.; Kifer, D.; Gehrke, J.; Venkitasubramaniam, M. L-diversity: Privacy beyond k-anonymity. Assoc. Comput. Mach. Trans. Knowl. Discov. Data 2007, 1, 3-es. [Google Scholar] [CrossRef]
- Li, N.; Li, T.; Venkatasubramanian, S. t-closeness: Privacy beyond k-anonymity and l-diversity. In Proceedings of the 2007 IEEE, 23rd International Conference on Data Engineering, Istanbul, Turkey, 15–20 April 2007; pp. 106–115. [Google Scholar] [CrossRef] [Green Version]
- Dwork, C.; McSherry, F.; Nissim, K.; Smith, A. Calibrating noise to sensitivity in private data analysis. In Proceedings of the Theory of Cryptography Conference; Springer: Berlin/Heidelberg, Germany, 2006; pp. 265–284. [Google Scholar] [CrossRef] [Green Version]
- Erlingsson, Ú.; Pihur, V.; Korolova, A. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM, Special Interest Group on Security, Audit and Control (SIGSAC) Conference on Computer and Communications Security, Scottsdale, AZ, USA, 3–7 November 2014; pp. 1054–1067. [Google Scholar] [CrossRef] [Green Version]
- Press, I.A. Apple Previews iOS 10, the Biggest iOS Release Ever. 2016. Available online: https://www.apple.com/newsroom/2016/06/apple-previews-ios-10-biggest-ios-release-ever/ (accessed on 4 December 2022).
- Muralidhar, K.; Domingo-Ferrer, J.; Martínez, S. epsilon-Differential Privacy for Microdata Releases Does Not Guarantee Confidentiality (Let Alone Utility). In Proceedings of the International Conference on Privacy in Statistical Databases; Springer: Cham, Switzerland, 2020; pp. 21–31. [Google Scholar]
- Culnane, C.; Rubinstein, B.I.; Teague, V. Health data in an open world. arXiv 2017, arXiv:1712.05627v1. [Google Scholar]
- gdpr.eu. General Data Protection Regulation. 2020. Available online: https://gdpr.eu (accessed on 4 December 2022).
- Tonic. The Fake Data Company. Available online: https://www.tonic.ai (accessed on 4 December 2022).
- Hazy Limited. Synthetic Data. Real Results. 2022. Available online: https://hazy.com (accessed on 4 December 2022).
- Datomize. Limited Data. Unlimited Insights. 2020. Available online: https://www.datomize.com (accessed on 4 December 2022).
- Mostly AI. Smarter Synthetic Data. Available online: https://mostly.ai (accessed on 4 December 2022).
- Nowok, B.; Raab, G.M.; Dibben, C. Synthpop: Bespoke creation of synthetic data in R. J. Stat. Softw. 2016, 74, 1–26. [Google Scholar] [CrossRef] [Green Version]
- Arslan, R.C.; Schilling, K.M.; Gerlach, T.M.; Penke, L. Using 26,000 diary entries to show ovulatory changes in sexual desire and behavior. J. Personal. Soc. Psychol. 2018, 121, 410. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
- Snoke, J.; Raab, G.; Nowok, B.; Dibben, C.; Slavkovic, A. General and specific utility measures for synthetic data. J. R. Stat. Soc. Ser. A Stat. Soc. 2018, 181, 663–688. [Google Scholar] [CrossRef]
- Maaten, L.v.d.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- McInnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426v2. [Google Scholar]
- Schapire, R.E. The strength of weak learnability. Mach. Learn. 1990, 5, 197–227. [Google Scholar] [CrossRef] [Green Version]
- Freund, Y. Boosting a weak learning algorithm by majority. Inf. Comput. 1995, 121, 256–285. [Google Scholar] [CrossRef] [Green Version]
- Freund, Y.; Schapire, R.E. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference proceedings, Machine Learning, San Francisco, CA, USA, 3–6 July 1996; pp. 148–156. [Google Scholar]
- Friedman, J.; Hastie, T.; Tibshirani, R. Additive logistic regression: A statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat. 2000, 28, 337–407. [Google Scholar] [CrossRef]
- Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Click, C.; Malohlava, M.; Candel, A.; Roark, H.; Parmar, V. Gradient boosting machine with H2O. H2O AI 2017. [Google Scholar]
- Friedman, J.; Hastie, T.; Tibshirani, R. The Elements of Statistical Learning; Springer Series in Statistics New York; Springer: Berlin/Heidelberg, Germany, 2001. [Google Scholar]
- Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
- Oliver, D.I. Privacy Engineering: A Dataflow and Ontological Approach; CreateSpace Independent Publishing Platform: North Charleston, SC, USA, 2014. [Google Scholar]
- Oliver, I.; Miche, Y. On the development of a metric for quality of information content over anonymised data-sets. In Proceedings of the 2016 IEEE, 10th International Conference on the Quality of Information and Communications Technology (QUATIC), Lisbon, Portugal, 6–9 September 2016; pp. 185–190. [Google Scholar] [CrossRef]
- Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef] [Green Version]
- Pál, D.; Póczos, B.; Szepesvári, C. Estimation of Rényi entropy and mutual information based on generalized nearest-neighbor graphs. Adv. Neural Inf. Process. Syst. 2010, 23, 1849–1857. [Google Scholar]
- Tharwat, A. Classification assessment methods. Appl. Comput. Inform. 2018, 17, 168–192. [Google Scholar] [CrossRef]
- Taylor, J. Introduction to Error Analysis, The Study of Uncertainties in Physical Measurements; University Science Book: Mill Valley, CA, USA, 1997. [Google Scholar]
- He, H.; Ma, Y. Imbalanced Learning: Foundations, Algorithms, and Applications; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
- He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar]
- Quintana, D. Synthetic datasets: A non-technical primer for the behavioural sciences to promote reproducibility and hypothesis-generation. PsyArXiv 2019. [Google Scholar] [CrossRef]
- Cios, K.J.; Moore, G.W. Uniqueness of medical data mining. Artif. Intell. Med. 2002, 26, 1–24. [Google Scholar] [CrossRef]
- Lenert, L.; McSwain, B.Y. Balancing health privacy, health information exchange and research in the context of the COVID-19 pandemic. J. Am. Med. Inform. Assoc. 2020, 27, 963–966. [Google Scholar] [CrossRef]
- Ienca, M.; Vayena, E. On the responsible use of digital data to tackle the COVID-19 pandemic. Nat. Med. 2020, 26, 463–464. [Google Scholar] [CrossRef] [Green Version]
- Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: https://archive.ics.uci.edu/ml/index.php (accessed on 4 December 2022).
- Chandra, G. Impacts of Data Synthesis: A Metric for Quantifiable Data Standards and Performances. Master’s Thesis, University of Oulu, Oulu, Finland, 2020. [Google Scholar]
- Harris, J.R.; Lippman, M.E.; Veronesi, U.; Willett, W. Breast cancer. N. Engl. J. Med. 1992, 327, 319–328. [Google Scholar] [CrossRef]
- Diabetesliitto. Finnish Diabetes Association. 2020. Available online: https://www.diabetes.fi (accessed on 4 December 2022).
Method | Description | Data Type |
---|---|---|
Non-parametric | ||
ctree, cart | Classification and regression trees | Any |
surv.ctree | Classification and regression trees | Duration |
Parametric | ||
norm | Normal linear regression | Numeric |
normrank * | Normal linear regression preserving | Numeric |
the marginal distribution | ||
logreg * | Logistic regression | Binary |
polyreg * | Polytomous logistic regression | Factor, >2 levels |
polr * | Ordered polygamous logistic regression | Ordered factor, >2 levels |
pmm | Predictive mean matching | Numeric |
Other | ||
sample | Random sample from the observed data | Any |
passive | Function of the other synthesized data | Any |
Test Set | Training Set | Predicted Labels | F1 Score | ROC Curve | ||
---|---|---|---|---|---|---|
Negative | Positive | |||||
Original | Original | Negative | 89 | 16 | 0.85 | |
DIPP | DIPP | Positive | 5 | 56 | 0.82 | 0.95 |
Negative | 83 | 19 | 0.88 | |||
SynD1 | SynD1 | Positive | 1 | 63 | 0.85 | 0.93 |
Negative | 82 | 20 | 0.87 | |||
SynD2 | SynD2 | Positive | 3 | 61 | 0.82 | 0.93 |
Negative | 98 | 4 | 0.85 | |||
SynD3 | SynD3 | Positive | 14 | 49 | 0.78 | 0.92 |
Negative | 83 | 18 | 0.86 | |||
SynD4 | SynD4 | Positive | 0 | 65 | 0.89 | 0.95 |
Original | Original | Negative | 113 | 2 | 0.98 | |
WDBC | WDBC | Positive | 3 | 53 | 0.95 | 0.96 |
Negative | 111 | 2 | 0.98 | |||
SynW1 | SynW1 | Positive | 2 | 56 | 0.97 | 0.97 |
Original | Negative | 102 | 5 | 0.94 | ||
WDBC | SynW2 | Positive | 7 | 57 | 0.90 | 0.92 |
Data Set | p-Value |
---|---|
SynD1 | 0.0965496 |
SynD2 | 0.0485093 |
SynD3 | 0.1755973 |
SynD4 | 0.0006553 |
SynW1 | 0.0837001 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chandra, G.; Siirtola, P.; Tamminen, S.; Knip, M.J.; Veijola, R.; Röning, J. Impacts of Data Synthesis: A Metric for Quantifiable Data Standards and Performances. Data 2022, 7, 178. https://doi.org/10.3390/data7120178
Chandra G, Siirtola P, Tamminen S, Knip MJ, Veijola R, Röning J. Impacts of Data Synthesis: A Metric for Quantifiable Data Standards and Performances. Data. 2022; 7(12):178. https://doi.org/10.3390/data7120178
Chicago/Turabian StyleChandra, Gunjan, Pekka Siirtola, Satu Tamminen, Mikael J. Knip, Riitta Veijola, and Juha Röning. 2022. "Impacts of Data Synthesis: A Metric for Quantifiable Data Standards and Performances" Data 7, no. 12: 178. https://doi.org/10.3390/data7120178
APA StyleChandra, G., Siirtola, P., Tamminen, S., Knip, M. J., Veijola, R., & Röning, J. (2022). Impacts of Data Synthesis: A Metric for Quantifiable Data Standards and Performances. Data, 7(12), 178. https://doi.org/10.3390/data7120178