[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1150402.1150531acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

YALE: rapid prototyping for complex data mining tasks

Published: 20 August 2006 Publication History

Abstract

KDD is a complex and demanding task. While a large number of methods has been established for numerous problems, many challenges remain to be solved. New tasks emerge requiring the development of new methods or processing schemes. Like in software development, the development of such solutions demands for careful analysis, specification, implementation, and testing. Rapid prototyping is an approach which allows crucial design decisions as early as possible. A rapid prototyping system should support maximal re-use and innovative combinations of existing methods, as well as simple and quick integration of new ones.This paper describes Yale, a free open-source environment forKDD and machine learning. Yale provides a rich variety of methods whichallows rapid prototyping for new applications and makes costlyre-implementations unnecessary. Additionally, Yale offers extensive functionality for process evaluation and optimization which is a crucial property for any KDD rapid prototyping tool. Following the paradigm of visual programming eases the design of processing schemes. While the graphical user interface supports interactive design, the underlying XML representation enables automated applications after the prototyping phase.After a discussion of the key concepts of Yale, we illustrate the advantages of rapid prototyping for KDD on case studies ranging from data pre-processing to result visualization. These case studies cover tasks like feature engineering, text mining, data stream mining and tracking drifting concepts, ensemble methods and distributed data mining. This variety of applications is also reflected in a broad user base, we counted more than 40,000 downloads during the last twelve months.

References

[1]
S. AlSairafi, F.-S. Emmanouil, M. Ghanem, N. Giannadakis, Y. Guo, D. Kalaitzopoulos, M. Osmond, A. Rowe, J. Syed, and P. Wendel. The Design of Discovery Net: Towards Open Grid Services for Knowledge Discovery. High-Performance Computing Applications, 17(3):297--315, 2003.
[2]
P. Brezany, I. Janciak, A. Wöhrer, and A. M. Tjoa. Grid Miner: A Framework for Knowledge Discovery on the Grid - from a Vision to Design and Implementation. In Proceedings of the Cracow Grid Workshop, Cracow, Poland, 2004.
[3]
M. Cannataro, A. Congiusta, C. Mastroianni, A. Pugliese, D. Talia, and P. Trunfio. Grid-Based Data Mining and Knowledge Discovery. In N. Zhong and J. Liu, editors, Intelligent Technologies for Information Analysis. Springer, Berlin, Germany, 2004.
[4]
H. Cunningham, K. Humphreys, R. Gaizauskas, and Y. Wilks. Software infrastructure for natural language processing. In Proceedings of the Fifth Conference on Applied Natural Language Processing (ANLP-97), pages 237--244, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc. http://gate.ac.uk/.
[5]
G. Daniel, J. Dienstuhl, S. Engell, S. Felske, K. Goser, R. Klinkenberg, K. Morik, O. Ritthoff, and H. Schmidt-Traub. Advances in Computational Intelligence - Theory and Practice, chapter Novel Learning Tasks, Optimization, and Their Application, pages 245--318. Springer, 2002.
[6]
T. Euler. Publishing Operational Models of Data Mining Case Studies. In Proceedings of the Workshop on Data Mining Case Studies at the 5th IEEE International Conference on Data Mining (ICDM), pages 99--106, Houston, Texas, USA, 2005.
[7]
J.-U. Kietz, A. Vaduva, and R. Zücker. MiningMart: Metadata-Driven Preprocessing. In Proceedings of the ECML/PKDD Workshop on Database Support for KDD, September 2001.
[8]
R. Klinkenberg. Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis (IDA), Special Issue on Incremental Learning Systems Capable of Dealing with Concept Drift, 8(3):281--300, May 2004.
[9]
R. Klinkenberg and T. Joachims. Detecting concept drift with support vector machines. In P. Langley, editor, Proceedings of the Seventeenth International Conference on Machine Learning (ICML), pages 487--494, San Francisco, CA, USA, 2000. Morgan Kaufmann.
[10]
I. Mierswa and K. Morik. Automatic feature extraction for classifying audio data. Machine Learning Journal, 58:127--149, 2005.
[11]
I. Mierswa and M. Wurst. Efficient feature construction by meta learning - guiding the search in meta hypothesis space. In Proc. of the Internation Conference on Machine Learning, Workshop on Meta Learning, 2005.
[12]
I. Mierswa and M. Wurst. Information preserving multi-objective feature selection for unsupervised learning. In Proc. of the Genetic and Evolutionary Computation Conference(GECCO '06), 2006. submitted.
[13]
K. Morik and M. Scholz. The Mining Mart Approach to Knowledge Discovery in Databases. In N. Zhong and J. Liu, editors, Intelligent Technologies for Information Analysis, chapter 3, pages 47--65. Springer, Berlin, Germany, 2004.
[14]
S. Raspl. PMML Version 3.0 - Overview and Status. In R. Grossman, editor, Proceedings of the Workshop on Data Mining Standards, Services and Platforms at the 10th ACM SIGKDD Int.Conf. on Knowledge Discovery and Data Mining (KDD), pages 18--22, 2004.
[15]
A. Romei, S. Ruggieri, and F. Turini. KDDML :A Middleware Language and System for Knowledge Discovery in Databases. In Proceedings of the 13th Italian Symposium on Advanced Database Systems (SEBD), June 2005.
[16]
G. Salton and C. Buckley. Term weighting approaches in automated text retrieval. Information Processing and Management, 24(5):513--523, 1988.
[17]
G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM (CACM), 18(11):613--620, November 1975.
[18]
M. Scholz. Sampling-Based Sequential Subgroup Mining. In R. L. Grossman, R. Bayardo, K. Bennett, and J. Vaidya, editors, Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '05), pages 265--274, Chicago, Illinois, USA, August 2005. ACM Press.
[19]
M. Scholz and R. Klinkenberg. Boosting Classifiers for Drifting Concepts. Intelligent Data Analysis (IDA), Special Issue on Knowledge Discovery from Data Streams, 2006. Accepted for publication.
[20]
I. H. Witten and E. Frank. Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco, CA, USA, 2000. http://www.cs.waikato.ac.nz/ml/weka/.

Cited By

View all
  • (2024)Effects of fishing restrictions on the recovery of the endangered Saimaa ringed seal (Pusa hispida saimensis) populationPLOS ONE10.1371/journal.pone.031125519:12(e0311255)Online publication date: 5-Dec-2024
  • (2024)Hands-on training about data clustering with orange data mining toolboxPLOS Computational Biology10.1371/journal.pcbi.101257420:12(e1012574)Online publication date: 18-Dec-2024
  • (2024)Usability and Adoption of Graphical Tools for Data-Driven DevelopmentProceedings of Mensch und Computer 202410.1145/3670653.3670658(231-241)Online publication date: 1-Sep-2024
  • Show More Cited By

Index Terms

  1. YALE: rapid prototyping for complex data mining tasks

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2006
    986 pages
    ISBN:1595933395
    DOI:10.1145/1150402
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 August 2006

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. KDD system
    2. audio and text mining
    3. data pre-processing
    4. data stream mining
    5. distributed data mining
    6. feature construction
    7. multimedia mining
    8. rapid prototyping
    9. result visualization

    Qualifiers

    • Article

    Conference

    KDD06

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)64
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Effects of fishing restrictions on the recovery of the endangered Saimaa ringed seal (Pusa hispida saimensis) populationPLOS ONE10.1371/journal.pone.031125519:12(e0311255)Online publication date: 5-Dec-2024
    • (2024)Hands-on training about data clustering with orange data mining toolboxPLOS Computational Biology10.1371/journal.pcbi.101257420:12(e1012574)Online publication date: 18-Dec-2024
    • (2024)Usability and Adoption of Graphical Tools for Data-Driven DevelopmentProceedings of Mensch und Computer 202410.1145/3670653.3670658(231-241)Online publication date: 1-Sep-2024
    • (2024)Extending Jupyter with Multi-Paradigm EditorsProceedings of the ACM on Human-Computer Interaction10.1145/36602478:EICS(1-22)Online publication date: 17-Jun-2024
    • (2024)SparkFlow: A Simple Big Data Analysis Application over Apache Spark2024 2nd International Conference on Technology Innovation and Its Applications (ICTIIA)10.1109/ICTIIA61827.2024.10761597(1-6)Online publication date: 12-Sep-2024
    • (2024)Exploring the role of new and enhanced BPM capabilities in customer experience management: does BPM matter?Business Process Management Journal10.1108/BPMJ-10-2023-083830:8(120-143)Online publication date: 29-May-2024
    • (2024)Naturally based pyrazoline derivatives as aminopeptidase N, VEGFR2 and MMP9 inhibitors: design, synthesis and molecular modelingRSC Advances10.1039/D4RA01801J14:31(22434-22448)Online publication date: 2024
    • (2024)Approaches And Research Directions For Adapting Rapid Prototyping In Industrial Service Development: A Systematic Literature ReviewProcedia CIRP10.1016/j.procir.2024.10.130130(562-572)Online publication date: 2024
    • (2024)The art of the ‘common good’: Property and nature values in strategic land-use planning in FinlandEnvironmental Science & Policy10.1016/j.envsci.2024.103815159(103815)Online publication date: Sep-2024
    • (2024)Towards more sustainable and trustworthy reporting in machine learningData Mining and Knowledge Discovery10.1007/s10618-024-01020-338:4(1909-1928)Online publication date: 30-Apr-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media