GAN-Based Approaches for Generating Structured Data in the Medical Domain
<p>The architecture of a GAN model: Two adversarial networks are trained together. The generator is trained to generate new realistic data that are indistinguishable from real data, while the discriminator determines whether the data are real or generated.</p> "> Figure 2
<p>Schematic overview of the evaluation of data generated by GAN variants: Data are divided into Train and Test, the GAN models generate synthetic data based on the Train data; Train and generated data are combined into an extended dataset. The classifier is trained once with only the original Train data (Silver Standard), and then using the extended dataset (including generated data) corresponding to each GAN variants. The classifiers are in the end evaluated with Test data.</p> "> Figure 3
<p>The mean accuracy of classifiers trained on data generated by GAN variants versus the size of the training data (as a percentage of the Train data) using the BCW dataset. Rows (<b>top</b> and <b>bottom</b>) show the RS and RE sampling, while columns (<b>left</b> and <b>right</b>) indicate the SVM and MLP classifiers. The Silver Standard only considers Train data (no generated data) for training the classifier. Error bars show the standard error of the mean over 10 samples for each point.</p> "> Figure 4
<p>Mean perimeter of tumors, a key feature in synthetic data from GAN variants over 10 different samples, versus the size of the training data (in percent) using the BCW.</p> "> Figure 5
<p>The mean accuracy of classifiers trained on data generated by GAN variants versus the size of the training data (as a percentage of the Train data) using the BCC dataset. Rows (<b>top</b> and <b>bottom</b>) show the RS and RE sampling, while columns (<b>left</b> and <b>right</b>) indicate the SVM and MLP classifiers. The Silver Standard only considers Train data (no generated data) for training the classifier. Error bars show the standard error of the mean over 10 samples for each point.</p> "> Figure 6
<p>Time usage (seconds) of the GAN variants for the generation of synthetic data over different number of epochs using the BCW dataset. For illustrative purposes, the GAN and CGAN data points are represented by squares and triangles in the main panel. The variation in time usage of GAN and CGAN for epochs (700–900) are shown in the inset.</p> "> Figure 7
<p>Memory usage (megabytes) of the GAN variants for the generation of synthetic data using the BCW dataset.</p> ">
Abstract
:1. Introduction
2. Methods
2.1. Selected GAN Variants
2.1.1. Conditional Generative Adversarial Networks
2.1.2. Specific GANs for Tabular Data
2.1.3. Wasserstein Generative Adversarial Networks
2.2. Data, Experimental Setup, and Evaluation Framework
2.2.1. Data
2.2.2. Experiments and Evaluation Framework
3. Results
3.1. Breast Cancer Wisconsin—BCW
3.2. Breast Cancer Coimbra—BCC
3.3. Time Usage of GAN Variants
3.4. Memory Usage
4. Discussion
4.1. Evaluation of Synthetic Data
4.2. Data Scarcity
5. Conclusions
6. Patents
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
GANs | Generative Adversarial Networks |
CGANs | Conditional Generative Adversarial Networks |
CTGANs | Conditional Tabular Generative Adversarial Networks |
CopulaGANs | Copula Generative Adversarial Networks |
WGANs | Wasserstein Generative Adversarial Networks |
WGANGP | Wasserstein Generative Adversarial Networks with Gradient Penalty |
SVM | Support Vector Machines |
MLP | Multi-Layer Perceptron |
BCW | Breast Cancer Wisconsin |
BCC | Breast Cancer Coimbra |
References
- Dahmen, J.; Cook, D. SynSys: A Synthetic Data Generation System for Healthcare Applications. Sensors 2019, 19, 1181. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Tucker, A.; Wang, Z.; Rotalinti, Y.; Myles, P. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. NPJ Digit. Med. 2020, 3, 147. [Google Scholar] [CrossRef] [PubMed]
- Chen, R.J.; Lu, M.Y.; Chen, T.Y.; Williamson, D.F.K.; Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 2021, 5, 493–497. [Google Scholar] [CrossRef] [PubMed]
- Hernandez, M.; Epelde, G.; Alberdi, A.; Cilla, R.; Rankin, D. Synthetic data generation for tabular health records: A systematic review. Neurocomputing 2022, 493, 28–45. [Google Scholar] [CrossRef]
- Voigt, P.; von dem Bussche, A. The EU General Data Protection Regulation (GDPR); Springer International Publishing: Berlin/Heidelberg, Germany, 2017. [Google Scholar] [CrossRef]
- Gehring, S.; Eulenfeld, R. German Medical Informatics Initiative: Unlocking Data for Research and Health Care. Methods Inf. Med. 2018, 57, e46–e49. [Google Scholar] [CrossRef] [Green Version]
- Bearnot, B.; Pearson, J.F.; Rodriguez, J.A. Using Publicly Available Data to Understand the Opioid Overdose Epidemic: Geospatial Distribution of Discarded Needles in Boston, Massachusetts. Am. J. Public Health 2018, 108, 1355–1357. [Google Scholar] [CrossRef]
- Saldanha, I.J.; Smith, B.T.; Ntzani, E.; Jap, J.; Balk, E.M.; Lau, J. The Systematic Review Data Repository (SRDR): Descriptive characteristics of publicly available data and opportunities for research. Syst. Rev. 2019, 8, 334. [Google Scholar] [CrossRef]
- Okeahalam, C.; Williams, V.; Otwombe, K. Factors associated with COVID-19 infections and mortality in Africa: A cross-sectional study using publicly available data. BMJ Open 2020, 10, e042750. [Google Scholar] [CrossRef]
- Khan, S.M.; Liu, X.; Nath, S.; Korot, E.; Faes, L.; Wagner, S.K.; Keane, P.A.; Sebire, N.J.; Burton, M.J.; Denniston, A.K. A global review of publicly available datasets for ophthalmological imaging: Barriers to access, usability, and generalisability. Lancet Digit. Health 2021, 3, e51–e66. [Google Scholar] [CrossRef]
- European Commission and Directorate-General for Research and Innovation. Rare Diseases: A Major Unmet Medical Need; Publications Office: Luxembourg, 2017. [Google Scholar] [CrossRef]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Google-Books-ID: omivDQAAQBAJ. [Google Scholar]
- Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative Adversarial Networks: An Overview. IEEE Signal Process. Mag. 2017, 35, 53–65. [Google Scholar] [CrossRef] [Green Version]
- Bourou, S.; El Saer, A.; Velivassaki, T.H.; Voulkidis, A.; Zahariadis, T. A Review of Tabular Data Synthesis Using GANs on an IDS Dataset. Information 2021, 12, 375. [Google Scholar] [CrossRef]
- Borisov, V.; Leemann, T.; Seßler, K.; Haug, J.; Pawelczyk, M.; Kasneci, G. Deep Neural Networks and Tabular Data: A Survey. arXiv 2021, arXiv:2110.01889. [Google Scholar]
- Patki, N.; Wedge, R.; Veeramachaneni, K. The Synthetic Data Vault. In Proceedings of the 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Montreal, QC, Canada, 17–19 October 2016; pp. 399–410. [Google Scholar] [CrossRef]
- Walonoski, J.; Kramer, M.; Nichols, J.; Quina, A.; Moesel, C.; Hall, D.; Duffett, C.; Dube, K.; Gallagher, T.; McLachlan, S. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J. Am. Med. Inform. Assoc. 2018, 25, 230–238. [Google Scholar] [CrossRef] [Green Version]
- Meyer, D.; Nagler, T. Synthia: Multidimensional synthetic data generation in Python. J. Open Source Softw. 2021, 6, 2863. [Google Scholar] [CrossRef]
- Nowok, B.; Raab, G.M.; Dibben, C. synthpop: Bespoke Creation of Synthetic Data in R. J. Stat. Softw. 2016, 74, 1–26. [Google Scholar] [CrossRef] [Green Version]
- Templ, M.; Meindl, B.; Kowarik, A.; Dupriez, O. Simulation of Synthetic Complex Data: The R Package simPop. J. Stat. Softw. 2017, 79, 1–38. [Google Scholar] [CrossRef] [Green Version]
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
- Kingma, D.P.; Welling, M. An Introduction to Variational Autoencoders. Found. Trends Mach. Learn. 2019, 12, 307–392. [Google Scholar] [CrossRef]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
- Elbattah, M.; Loughnane, C.; Guérin, J.L.; Carette, R.; Cilia, F.; Dequen, G. Variational Autoencoder for Image-Based Augmentation of Eye-Tracking Data. J. Imaging 2021, 7, 83. [Google Scholar] [CrossRef]
- Gootjes-Dreesbach, L.; Sood, M.; Sahay, A.; Hofmann-Apitius, M.; Fröhlich, H. Variational Autoencoder Modular Bayesian Networks for Simulation of Heterogeneous Clinical Study Data. Front. Big Data 2020, 3, 16. [Google Scholar] [CrossRef] [PubMed]
- Alqahtani, H.; Kavakli-Thorne, M.; Kumar, G. Applications of Generative Adversarial Networks (GANs): An Updated Review. Arch. Comput. Methods Eng. 2021, 28, 525–552. [Google Scholar] [CrossRef]
- Hameed, K.; Chai, D.; Rassau, A. Texture-based latent space disentanglement for enhancement of a training dataset for ANN-based classification of fruit and vegetables. Inf. Process. Agric. 2021, in press. [Google Scholar] [CrossRef]
- Vaccari, I.; Orani, V.; Paglialonga, A.; Cambiaso, E.; Mongelli, M. A Generative Adversarial Network (GAN) Technique for Internet of Medical Things Data. Sensors 2021, 21, 3726. [Google Scholar] [CrossRef]
- Lv, J.; Zhu, J.; Yang, G. Which GAN? A comparative study of generative adversarial network-based fast MRI reconstruction. Philos. Trans. R. Soc. 2021, 379, 20200203. [Google Scholar] [CrossRef]
- Khan, Z.K.; Umar, A.I.; Shirazi, S.H.; Rasheed, A.; Qadir, A.; Gul, S. Image based analysis of meibomian gland dysfunction using conditional generative adversarial neural network. BMJ Open Ophthalmol. 2021, 6, e000436. [Google Scholar] [CrossRef] [PubMed]
- Wanichwecharungruang, B.; Kaothanthong, N.; Pattanapongpaiboon, W.; Chantangphol, P.; Seresirikachorn, K.; Srisuwanporn, C.; Parivisutt, N.; Grzybowski, A.; Theeramunkong, T.; Ruamviboonsuk, P. Deep Learning for Anterior Segment Optical Coherence Tomography to Predict the Presence of Plateau Iris. Ranslational Vis. Sci. Technol. 2021, 10, 7. [Google Scholar] [CrossRef] [PubMed]
- Larsen, A.B.L.; Sønderby, S.K.; Larochelle, H.; Winther, O. Autoencoding beyond pixels using a learned similarity metric. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 20–22 June 2016; PMLR: London, UK, 2016; pp. 1558–1566. [Google Scholar]
- Baur, C.; Wiestler, B.; Albarqouni, S.; Navab, N. Deep Autoencoding Models for Unsupervised Anomaly Segmentation in Brain MR Images. In Proceedings of the Conjunction with MICCAI 2018, Granada, Spain, 16 September 2018; Volume 11383, pp. 161–169. [Google Scholar]
- Kwon, G.; Han, C.; Kim, D. Generation of 3D Brain MRI Using Auto-Encoding Generative Adversarial Networks. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Shenzhen, China, 13–17 October 2019. [Google Scholar]
- Liu, M.Y.; Huang, X.; Yu, J.; Wang, T.C.; Mallya, A. Generative Adversarial Networks for Image and Video Synthesis: Algorithms and Applications. arXiv 2020, arXiv:2008.02793. [Google Scholar] [CrossRef]
- Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and Improving the Image Quality of StyleGAN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Shahriar, S. GAN Computers Generate Arts? A Survey on Visual Arts, Music, and Literary Text Generation using Generative Adversarial Network. Displays 2022, 73, 102237. [Google Scholar] [CrossRef]
- Choi, E.; Biswal, S.; Malin, B.; Duke, J.; Stewart, W.F.; Sun, J. Generating Multi-label Discrete Patient Records using Generative Adversarial Networks. arXiv 2018, arXiv:1703.06490. [Google Scholar]
- Subramanian, S.; Rajeswar, S.; Dutil, F.; Pal, C.; Courville, A. Adversarial Generation of Natural Language. In Proceedings of the 2nd Workshop on Representation Learning for NLP, Vancouver, BC, Canada, 3 August 2017; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 241–251. [Google Scholar] [CrossRef]
- Ren, Y.; Lin, J.; Tang, S.; Zhou, J.; Yang, S.; Qi, Y.; Ren, X. Generating Natural Language Adversarial Examples on a Large Scale with Generative Models. arXiv 2020, arXiv:2003.10388. [Google Scholar] [CrossRef]
- Baowaly, M.K.; Lin, C.C.; Liu, C.L.; Chen, K.T. Synthesizing electronic health records using improved generative adversarial networks. J. Am. Med. Inform. Assoc. 2019, 26, 228–241. [Google Scholar] [CrossRef] [PubMed]
- Mendelevitch, O.; Lesh, M.D. Fidelity and Privacy of Synthetic Medical Data. arXiv 2021, arXiv:2101.08658. [Google Scholar]
- Goncalves, A.; Ray, P.; Soper, B.; Stevens, J.; Coyle, L.; Sales, A.P. Generation and evaluation of synthetic patient data. BMC Med. Res. Methodol. 2020, 20, 108. [Google Scholar] [CrossRef] [PubMed]
- Zhao, Z.; Kunar, A.; Birke, R.; Chen, L.Y. CTAB-GAN: Effective Table Data Synthesizing. In Proceedings of the 13th Asian Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 97–112. [Google Scholar]
- Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling Tabular data using Conditional GAN. arXiv 2019, arXiv:1907.00503. [Google Scholar]
- Gui, J.; Sun, Z.; Wen, Y.; Tao, D.; Ye, J. A Review on Generative Adversarial Networks: Algorithms, Theory, and Applications. arXiv 2020, arXiv:2001.06937. [Google Scholar] [CrossRef]
- Wu, X.; Xu, K.; Hall, P. A survey of image synthesis and editing with generative adversarial networks. Tsinghua Sci. Technol. 2017, 22, 660–674. [Google Scholar] [CrossRef] [Green Version]
- Pieters, M.; Wiering, M. Comparing Generative Adversarial Network Techniques for Image Creation and Modification. arXiv 2018, arXiv:1803.09093. [Google Scholar] [CrossRef]
- Torres-Reyes, N.; Latifi, S. Audio Enhancement and Synthesis using Generative Adversarial Networks: A Survey. Int. J. Comput. Appl. 2019, 182, 27–31. [Google Scholar] [CrossRef]
- Mirza, M.; Osindero, S. Conditional Generative Adversarial Nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
- Xu, L.; Veeramachaneni, K. Synthesizing Tabular Data using Generative Adversarial Networks. arXiv 2018, arXiv:1811.11264. [Google Scholar]
- Kamthe, S.; Assefa, S.; Deisenroth, M. Copula Flows for Synthetic Data Generation. arXiv 2021, arXiv:2101.00598. [Google Scholar]
- Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. arXiv 2017, arXiv:1701.07875. [Google Scholar]
- Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A. Improved Training of Wasserstein GANs. arXiv 2017, arXiv:1704.00028. [Google Scholar]
- Engelmann, J.; Lessmann, S. Conditional Wasserstein GAN-based Oversampling of Tabular Data for Imbalanced Learning. Expert Syst. Appl. 2021, 174, 114582. [Google Scholar] [CrossRef]
- Wolberg, W.; Street, W.; Mangasarian, O. Breast Cancer Wisconsin (Diagnostic); UCI Machine Learning Repository. 1995. Available online: https://archive-beta.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+diagnostic (accessed on 10 May 2022).
- Patrício, M.; Pereira, J.; Crisóstomo, J.; Matafome, P.; Gomes, M.; Seiça, R.; Caramelo, F. Breast Cancer Coimbra; UCI Machine Learning Repository. 2018. Available online: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Coimbra (accessed on 10 May 2022).
- Li, Y. Performance Evaluation of Machine Learning Methods for Breast Cancer Prediction. Appl. Comput. Math. 2018, 7, 212. [Google Scholar] [CrossRef]
- Patrício, M.; Pereira, J.; Crisóstomo, J.; Matafome, P.; Gomes, M.; Seiça, R.; Caramelo, F. Using Resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer 2018, 18, 29. [Google Scholar] [CrossRef] [Green Version]
- Austria, Y.D.; Goh, M.L.; Sta Maria, L., Jr.; Lalata, J.A.; Goh, J.E.; Vicente, H. Comparison of Machine Learning Algorithms in Breast Cancer Prediction Using the Coimbra Dataset. Int. J. Simul. Syst. Sci. Technol. 2019, 7, 23.1–23.8. [Google Scholar] [CrossRef]
- Wolberg, W.H.; Street, W.; Mangasarian, O. Machine learning techniques to diagnose breast cancer from image-processed nuclear features of fine needle aspirates. Cancer Lett. 1994, 77, 163–171. [Google Scholar] [CrossRef]
- Shahnaz, C.; Hossain, J.; Fattah, S.A.; Ghosh, S.; Khan, A.I. Efficient approaches for accuracy improvement of breast cancer classification using wisconsin database. In Proceedings of the 2017 IEEE Region 10 Humanitarian Technology Conference (R10-HTC), Dhaka, Bangladesh, 21–23 December 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 792–797. [Google Scholar] [CrossRef]
- Obaid, O.I.; Mohammed, M.A.; Ghani, M.K.A.; Mostafa, S.A.; AL-Dhief, F.T. Evaluating the Performance of Machine Learning Techniques in the Classification of Wisconsin Breast Cancer. Int. J. Eng. Technol. 2018, 7, 160–166. [Google Scholar] [CrossRef]
- Agarap, A.F.M. On breast cancer detection: An application of machine learning algorithms on the wisconsin diagnostic dataset. In Proceedings of the 2nd International Conference on Machine Learning and Soft Computing—ICMLSC’18, Phu Quoc Island, Vietnam, 2–4 February 2018; ACM Press: New York, NY, USA, 2018; pp. 5–9. [Google Scholar] [CrossRef] [Green Version]
- Anguita, D.; Ghio, A.; Greco, N.; Oneto, L.; Ridella, S. Model selection for support vector machines. Advant. Disadvant. Mach. Learn. Theory 2010, 12, 1–8. [Google Scholar] [CrossRef]
- Dankar, F.K.; Ibrahim, M.K.; Ismail, L. A Multi-Dimensional Evaluation of Synthetic Data Generators. IEEE Access 2022, 10, 11147–11158. [Google Scholar] [CrossRef]
- Theis, L.; Oord, A.v.d.; Bethge, M. A Note on the Evaluation of Generative Models. arXiv 2015, arXiv:1511.01844. [Google Scholar]
- Rankin, D.; Black, M.; Bond, R.; Wallace, J.; Mulvenna, M.; Epelde, G. Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing. JMIR Med. Inform. 2020, 8, e18910. [Google Scholar] [CrossRef]
GAN | CGAN | CTGAN | CopulaGAN | WGANGP | |
---|---|---|---|---|---|
BCW | 900 | 100 | 1000 | 800 | 300 |
BCC | 300 | 300 | 300 | 900 | 100 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Abedi, M.; Hempel, L.; Sadeghi, S.; Kirsten, T. GAN-Based Approaches for Generating Structured Data in the Medical Domain. Appl. Sci. 2022, 12, 7075. https://doi.org/10.3390/app12147075
Abedi M, Hempel L, Sadeghi S, Kirsten T. GAN-Based Approaches for Generating Structured Data in the Medical Domain. Applied Sciences. 2022; 12(14):7075. https://doi.org/10.3390/app12147075
Chicago/Turabian StyleAbedi, Masoud, Lars Hempel, Sina Sadeghi, and Toralf Kirsten. 2022. "GAN-Based Approaches for Generating Structured Data in the Medical Domain" Applied Sciences 12, no. 14: 7075. https://doi.org/10.3390/app12147075
APA StyleAbedi, M., Hempel, L., Sadeghi, S., & Kirsten, T. (2022). GAN-Based Approaches for Generating Structured Data in the Medical Domain. Applied Sciences, 12(14), 7075. https://doi.org/10.3390/app12147075