[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
Enhancements to the data mining process
Publisher:
  • Stanford University
  • 408 Panama Mall, Suite 217
  • Stanford
  • CA
  • United States
Order Number:UMI Order No. GAX97-23376
Reflects downloads up to 02 Mar 2025Bibliometrics
Skip Abstract Section
Abstract

Data mining is the emerging science and industry of applying modern statistical and computational technologies to the problem of finding useful patterns hidden within large databases. This thesis describes the data mining process and presents advances and novel methods for the six steps in the data mining process: extracting data from a database or data warehouse, cleaning the data, data engineering, algorithm engineering, data mining, and analyzing the results.

We show how the standard data extraction process can be improved by building a direct interface between a data-mining algorithm and a relational database management system. Next, in data cleaning, we show how automatically iterating through the data mining process can identify records that can be profitably ignored during data mining. For data engineering, we develop an automated way to iterate through the data mining process to choose the subset of attributes that yields the best estimated results. In algorithm engineering, a similar process is used to automatically set the parameters of a mining algorithm.

For the data mining algorithms, we study enhancements to classification tree induction methods and Bayesian methods. Our new flexible Bayes data-mining algorithm is fast, understandable, and more accurate than the standard Bayesian classifier in most situations. In classification tree induction we study various univariate splitting criteria and multivariate partitions.

The analysis of results is necessarily domain-dependent. In an example applying data mining to stock selection, we discuss a key requirement in real-world applications: using appropriate domain-dependent methods to evaluate the proposed solution.

Cited By

  1. Taylor P, Griffiths N, Bhalerao A, Xu Z, Gelencser A and Popham T (2017). Investigating the Feasibility of Vehicle Telemetry Data as a Means of Predicting Driver Workload, International Journal of Mobile Human Computer Interaction, 9:3, (54-72), Online publication date: 1-Jul-2017.
  2. Vidulin V, Bohanec M and Gams M (2014). Combining human analysis and machine data mining to obtain credible data relations, Information Sciences: an International Journal, 288:C, (254-278), Online publication date: 20-Dec-2014.
  3. Bergholz A, De Beer J, Glahn S, Moens M, Paaß G and Strobel S (2010). New filtering approaches for phishing email, Journal of Computer Security, 18:1, (7-35), Online publication date: 1-Jan-2010.
  4. El-Mouadib F, Zubi Z and Alhouni A New implementation of unsupervised ID3 algorithm (NIU-ID3) using Visual Basic.net Proceedings of the 8th WSEAS international conference on Data networks, communications, computers, (95-108)
  5. Hilas C and Sahalos J An Application of Decision Trees for Rule Extraction Towards Telecommunications Fraud Detection Knowledge-Based Intelligent Information and Engineering Systems and the XVII Italian Workshop on Neural Networks on Proceedings of the 11th International Conference, (1112-1121)
  6. Abe H and Yamaguchi T Constructive meta-level feature selection method based on method repositories Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining, (70-80)
  7. Abdullah A and Hussain A Using biclustering for automatic attribute selection to enhance global visualization Proceedings of the 1st first visual information expert conference on Pixelization paradigm, (35-47)
  8. ACM
    Liu H, Yin X and Han J An efficient multi-relational Naïve Bayesian classifier based on semantic relationship graph Proceedings of the 4th international workshop on Multi-relational mining, (39-48)
  9. Li X (2005). A scalable decision tree system and its application in pattern recognition and intrusion detection, Decision Support Systems, 41:1, (112-130), Online publication date: 1-Nov-2005.
  10. Li X (2005). A scalable decision tree system and its application in pattern recognition and intrusion detection, Decision Support Systems, 41:1, (112-130), Online publication date: 1-Nov-2005.
  11. ACM
    Amor N, Benferhat S and Elouedi Z Naive Bayes vs decision trees in intrusion detection systems Proceedings of the 2004 ACM symposium on Applied computing, (420-424)
  12. Kwak N and Choi C (2003). Feature Extraction Based on ICA for Binary Classification Problems, IEEE Transactions on Knowledge and Data Engineering, 15:6, (1374-1388), Online publication date: 1-Nov-2003.
  13. Graefe G, Fayyad U and Chaudhuri S On the efficient gathering of sufficient statistics for classification from large SQL databases Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, (204-208)
Contributors
  • IBM Research - Almaden
Please enable JavaScript to view thecomments powered by Disqus.

Recommendations