[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1014052.1014069acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Systematic data selection to mine concept-drifting data streams

Published: 22 August 2004 Publication History

Abstract

One major problem of existing methods to mine data streams is that it makes ad hoc choices to combine most recent data with some amount of old data to search the new hypothesis. The assumption is that the additional old data always helps produce a more accurate hypothesis than using the most recent data only. We first criticize this notion and point out that using old data blindly is not better than "gambling"; in other words, it helps increase the accuracy only if we are "lucky." We discuss and analyze the situations where old data will help and what kind of old data will help. The practical problem on choosing the right example from old data is due to the formidable cost to compare different possibilities and models. This problem will go away if we have an algorithm that is extremely efficient to compare all sensible choices with little extra cost. Based on this observation, we propose a simple, efficient and accurate cross-validation decision tree ensemble method.

References

[1]
Aggarwal, C. C. (2003). A framework for diagnosing changes in evolving data streams. In Proceedings of ACM SIGMOD 2003, pages 575--586.
[2]
Babcock, B., Babu, S., Datar, M., Motawani, R., and Widom, J. (2002). Models and issues in data stream systems. In ACM Symposium on Principles of Database Systems (PODS).
[3]
Babu, S. and Widom, J. (2001). Continuous queries over data streams. SIGMOD Record, 30:109--120.
[4]
Breiman, L. (2001). Random forests. Machine Learning, 45(1):5--32.
[5]
Chen, Y., Dong, G., Han, J., Wah, B. W., and Wang, J. (2002). Multi-dimensional regression analysis of time-series data streams. In Proc. of Very Large Database (VLDB), Hong Kong, China.
[6]
Domingos, P. and Hulten, G. (2000). Mining high-speed data streams. In Int'l Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 71--80, Boston, MA. ACM Press.
[7]
Fan, W. (August 2004b). StreamMiner: A classifier ensemble-based engine to mine concept-drifting data streams. In Proceedings of 2004 International Conference on Very Large Data Bases (VLDB'2004), Toronto, Canada.
[8]
Fan, W. (July 2004a). On the optimality of probability estimation by random decision trees. In Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI'2004), San Jose, California, USA.
[9]
Fan, W., an Huang, Y., Wang, H., and Yu, P. S. (April 2004). Active mining of data streams. In Proceedings of 2004 SIAM International Conference on Data Mining, pages 457--461.
[10]
Fan, W., Wang, H., Yu, P. S., and Ma, S. (2003). Is random model better? on its accuracy and efficiency. In Proceedings of Third IEEE International Conference on Data Mining (ICDM'2003).
[11]
Gao, L. and Wang, X. (2002). Continually evaluating similarity-based pattern queries on a streaming time series. In Int'l Conf. Management of Data (SIGMOD), Madison, Wisconsin.
[12]
Greenwald, M. and Khanna, S. (2001). Space-efficient online computation of quantile summaries. In Int'l Conf. Management of Data (SIGMOD), pages 58--66, Santa Barbara, CA.
[13]
Guha, S., Milshra, N., Motwani, R., and O'Callaghan, L. (2000). Clustering data streams. In IEEE Symposium on Foundations of Computer Science (FOCS), pages 359--366.
[14]
Hulten, G., Spencer, L., and Domingos, P. (2001). Mining time-changing data streams. In Int'l Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 97--106, San Francisco, CA. ACM Press.
[15]
Street, W. N. and Kim, Y. (2001). A streaming ensemble algorithm (SEA) for large-scale classification. In Int'l Conf. on Knowledge Discovery and Data Mining (SIGKDD).
[16]
Wang, H., Fan, W., Yu, P., and Han, J. (2003). Mining concept-drifting data streams with ensemble classifiers. In Proceedings of ACM SIGKDD International Conference on knowledge discovery and data mining (SIGKDD2003), pages 226--235.

Cited By

View all
  • (2024)Incremental Data Drifting: Evaluation Metrics, Data Generation, and Approach ComparisonACM Transactions on Intelligent Systems and Technology10.1145/365563015:4(1-26)Online publication date: 24-May-2024
  • (2024)Dynamic Selection of Machine Learning Models for Time-Series DataInformation Sciences10.1016/j.ins.2024.120360(120360)Online publication date: Feb-2024
  • (2024)An experimental review of the ensemble-based data stream classification algorithms in non-stationary environmentsComputers and Electrical Engineering10.1016/j.compeleceng.2024.109420118(109420)Online publication date: Sep-2024
  • Show More Cited By

Index Terms

  1. Systematic data selection to mine concept-drifting data streams

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2004
    874 pages
    ISBN:1581138881
    DOI:10.1145/1014052
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 August 2004

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. concept-drift
    2. data streams
    3. decision trees

    Qualifiers

    • Article

    Conference

    KDD04

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)45
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 04 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Incremental Data Drifting: Evaluation Metrics, Data Generation, and Approach ComparisonACM Transactions on Intelligent Systems and Technology10.1145/365563015:4(1-26)Online publication date: 24-May-2024
    • (2024)Dynamic Selection of Machine Learning Models for Time-Series DataInformation Sciences10.1016/j.ins.2024.120360(120360)Online publication date: Feb-2024
    • (2024)An experimental review of the ensemble-based data stream classification algorithms in non-stationary environmentsComputers and Electrical Engineering10.1016/j.compeleceng.2024.109420118(109420)Online publication date: Sep-2024
    • (2024)A novel framework for concept drift detection using autoencoders for classification problems in data streamsInternational Journal of Machine Learning and Cybernetics10.1007/s13042-024-02223-2Online publication date: 3-Jun-2024
    • (2023)Exploring nonlinear spatiotemporal effects for personalized next point-of-interest recommendation一种基于非线性时空效应的个性化 下一个兴趣点推荐方法Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.220030424:9(1273-1286)Online publication date: 22-Sep-2023
    • (2023)Semi-supervised One-pass Learning under Distribution ShiftProceedings of the 2023 6th International Conference on Big Data Technologies10.1145/3627377.3627446(306-313)Online publication date: 22-Sep-2023
    • (2023)Cost-Effective Incremental Deep Model: Matching Model Capacity With the Least SamplingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.313262235:4(3575-3588)Online publication date: 1-Apr-2023
    • (2021)Deep Session Interest Network Based on the Time Interval Encoding for the Click-through Rate Prediction2021 IEEE International Conference on Computer Science, Artificial Intelligence and Electronic Engineering (CSAIEE)10.1109/CSAIEE54046.2021.9543196(206-212)Online publication date: 20-Aug-2021
    • (2021)Recurring Drift Detection and Model Selection-Based Ensemble Classification for Data Streams with Unlabeled DataNew Generation Computing10.1007/s00354-021-00126-2Online publication date: 20-Apr-2021
    • (2020)TADILOF: Time Aware Density-Based Incremental Local Outlier Detection in Data StreamsSensors10.3390/s2020582920:20(5829)Online publication date: 15-Oct-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media