[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1081870.1081969acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Generation of synthetic data sets for evaluating the accuracy of knowledge discovery systems

Published: 21 August 2005 Publication History

Abstract

Information Discovery and Analysis Systems (IDAS) are designed to correlate multiple sources of data and use data mining techniques to identify potential significant events. Application domains for IDAS are numerous and include the emerging area of homeland security.Developing test cases for an IDAS requires background data sets into which hypothetical future scenarios can be overlaid. The IDAS can then be measured in terms of false positive and false negative error rates. Obtaining the test data sets can be an obstacle due to both privacy issues and also the time and cost associated with collecting a diverse set of data sources.In this paper, we give an overview of the design and architecture of an IDAS Data Set Generator (IDSG) that enables a fast and comprehensive test of an IDAS. The IDSG generates data using statistical and rule-based algorithms and also semantic graphs that represent interdependencies between attributes. A credit card transaction application is used to illustrate the approach.

References

[1]
Abowd, J.M. and Lane, J.I. Synthetic Data and Confidentiality Protection. U.S. Census Bureau, LEHD Program Technical Paper No. TP-2003-10, (2003).
[2]
Chan, P.K., Fan, W., Prodromidis, A.L., and Stolfo, S.J. Distributed Data Mining in Credit Card Fraud Detection. IEEE Intelligent Systems 14(6), 67--74. (1999).
[3]
Department of Defense, Office of the Inspector General. Information Technology Management: Terrorism Information Awareness Program. Report No. D-2004-033. (2004).
[4]
General Accounting Office, Data Mining: Federal Efforts Cover a Wide Range of Uses. GAO-04-548. (2004).
[5]
Kusiak, A., Kernstine, K.H., Kern, J.A., McLaughlin, K.A., and Tseng, T.L. Data Mining: Medical and Engineering Case Studies. Proceedings of the Industrial Engineering Research 2000 Conference, Cleveland, Ohio, May 21-23, (2000), 1--7.
[6]
Leskovec, J. Grobelnik, M., and Millic-Frayling, N. Learning Sub-structures of Document Semantic Graphs for Document Summarization. LinkKDD 2004, August 2004, Seattle WA, USA. (2004).
[7]
Ormerod, T., Morley, N., Ball, L., Langley, C., and Spenser, C. Using Ethnography To Design a Mass Detection Tool (MDT) For The Early Discovery of Insurance Fraud. CHI 2003, April 5--10, 2003, Ft. Lauderdale, Florida, USA. ACM 1-58113-637-4/03/0004 (2003).
[8]
Prince, E., and Nicholson, W.L. A Test of a Robust/Resistant Refinement Procedure on Synthetic Data. Acta Cryst., A39, (1983), 407--410.
[9]
Rogers, M. Graham, J., and Tonge, R.P. Using Statistical Image Models for Objective Evaluation of Spot Detection in Two-Dimensional Gels. Proteomics, June, 3(6) (2003), 879--886.
[10]
Varga, T. and Bunke, H. Generation of Synthetic Training Data for an HMM-based Handwriting Recognition System. Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR 2003) 0-7695-1960-1/03 IEEE Computer Society (2003).
[11]
Widdows, D. and Dorow, B. A Graph Model for Unsupervised Lexical Acquisition. 19th International Conference on Computational Linguistics (COLING 19). Taipei, August (2002) 1093--1099.
[12]
Yun, W.T., Stefanova, L., Mitra, A.K., and Krishnamurti, T.N. Multi-Model Synthetic Superensemble Prediction System. Acta Cryst., A39, (1983), 407--410.
[13]
Zhu, X., Aref, W.G., Fan, J., Catlin, A.C., and Elmagarmid, A.K. Medical Video Mining for Efficient Database Indexing, Management, and Access. IEEE Int. Conf. On Data Engineering (ICDE '03), Bangalore, India, March 5-March 8, (2003), 1--12.

Cited By

View all
  • (2024)Enhancing Machine Learning Model Accuracy through Novel SDNIoT Dataset Generation2024 International Conference on Intelligent Systems for Cybersecurity (ISCS)10.1109/ISCS61804.2024.10581240(01-06)Online publication date: 3-May-2024
  • (2024)Trading Off Scalability, Privacy, and Performance in Data SynthesisIEEE Access10.1109/ACCESS.2024.336655612(26642-26654)Online publication date: 2024
  • (2020)Large-Scale Generation and Validation of Synthetic PMU DataIEEE Transactions on Smart Grid10.1109/TSG.2020.297734911:5(4290-4298)Online publication date: Sep-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
August 2005
844 pages
ISBN:159593135X
DOI:10.1145/1081870
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data generation
  2. data mining
  3. information discovery

Qualifiers

  • Article

Conference

KDD05

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Enhancing Machine Learning Model Accuracy through Novel SDNIoT Dataset Generation2024 International Conference on Intelligent Systems for Cybersecurity (ISCS)10.1109/ISCS61804.2024.10581240(01-06)Online publication date: 3-May-2024
  • (2024)Trading Off Scalability, Privacy, and Performance in Data SynthesisIEEE Access10.1109/ACCESS.2024.336655612(26642-26654)Online publication date: 2024
  • (2020)Large-Scale Generation and Validation of Synthetic PMU DataIEEE Transactions on Smart Grid10.1109/TSG.2020.297734911:5(4290-4298)Online publication date: Sep-2020
  • (2019)PMU Data Feature Considerations for Realistic, Synthetic Data Generation2019 North American Power Symposium (NAPS)10.1109/NAPS46351.2019.9000335(1-6)Online publication date: Oct-2019
  • (2019)Data generators: a short survey of techniques and use cases with focus on testing2019 IEEE 9th International Conference on Consumer Electronics (ICCE-Berlin)10.1109/ICCE-Berlin47944.2019.8966202(189-194)Online publication date: Sep-2019
  • (2018)A tool for generating synthetic dataProceedings of the First International Conference on Data Science, E-learning and Information Systems10.1145/3279996.3280018(1-6)Online publication date: 1-Oct-2018
  • (2018)Statistical Methods for Generating Synthetic Email Data Sets2018 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2018.8622601(3986-3990)Online publication date: Dec-2018
  • (2017)Applying Combinatorial Testing to Data Mining Algorithms2017 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW)10.1109/ICSTW.2017.46(253-261)Online publication date: Mar-2017
  • (2016)Generative Data Models for Validation and Evaluation of Visualization TechniquesProceedings of the Sixth Workshop on Beyond Time and Errors on Novel Evaluation Methods for Visualization10.1145/2993901.2993907(112-124)Online publication date: 24-Oct-2016
  • (2016)Automatic Artificial Data Generator: Framework and implementation2016 International Conference on Information and Communication Technology (ICICTM)10.1109/ICICTM.2016.7890777(56-60)Online publication date: 2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media