More Web Proxy on the site http://driver.im/

Article

Systematic data selection to mine concept-drifting data streams

Author:

Wei FanAuthors Info & Claims

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 128 - 137

https://doi.org/10.1145/1014052.1014069

Published: 22 August 2004 Publication History

Abstract

One major problem of existing methods to mine data streams is that it makes ad hoc choices to combine most recent data with some amount of old data to search the new hypothesis. The assumption is that the additional old data always helps produce a more accurate hypothesis than using the most recent data only. We first criticize this notion and point out that using old data blindly is not better than "gambling"; in other words, it helps increase the accuracy only if we are "lucky." We discuss and analyze the situations where old data will help and what kind of old data will help. The practical problem on choosing the right example from old data is due to the formidable cost to compare different possibilities and models. This problem will go away if we have an algorithm that is extremely efficient to compare all sensible choices with little extra cost. Based on this observation, we propose a simple, efficient and accurate cross-validation decision tree ensemble method.

References

[1]

Aggarwal, C. C. (2003). A framework for diagnosing changes in evolving data streams. In Proceedings of ACM SIGMOD 2003, pages 575--586.

Digital Library

[2]

Babcock, B., Babu, S., Datar, M., Motawani, R., and Widom, J. (2002). Models and issues in data stream systems. In ACM Symposium on Principles of Database Systems (PODS).

Digital Library

[3]

Babu, S. and Widom, J. (2001). Continuous queries over data streams. SIGMOD Record, 30:109--120.

Digital Library

[4]

Breiman, L. (2001). Random forests. Machine Learning, 45(1):5--32.

Digital Library

[5]

Chen, Y., Dong, G., Han, J., Wah, B. W., and Wang, J. (2002). Multi-dimensional regression analysis of time-series data streams. In Proc. of Very Large Database (VLDB), Hong Kong, China.

Digital Library

[6]

Domingos, P. and Hulten, G. (2000). Mining high-speed data streams. In Int'l Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 71--80, Boston, MA. ACM Press.

Digital Library

[7]

Fan, W. (August 2004b). StreamMiner: A classifier ensemble-based engine to mine concept-drifting data streams. In Proceedings of 2004 International Conference on Very Large Data Bases (VLDB'2004), Toronto, Canada.

Digital Library

[8]

Fan, W. (July 2004a). On the optimality of probability estimation by random decision trees. In Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI'2004), San Jose, California, USA.

Digital Library

[9]

Fan, W., an Huang, Y., Wang, H., and Yu, P. S. (April 2004). Active mining of data streams. In Proceedings of 2004 SIAM International Conference on Data Mining, pages 457--461.

[10]

Fan, W., Wang, H., Yu, P. S., and Ma, S. (2003). Is random model better? on its accuracy and efficiency. In Proceedings of Third IEEE International Conference on Data Mining (ICDM'2003).

Digital Library

[11]

Gao, L. and Wang, X. (2002). Continually evaluating similarity-based pattern queries on a streaming time series. In Int'l Conf. Management of Data (SIGMOD), Madison, Wisconsin.

Digital Library

[12]

Greenwald, M. and Khanna, S. (2001). Space-efficient online computation of quantile summaries. In Int'l Conf. Management of Data (SIGMOD), pages 58--66, Santa Barbara, CA.

Digital Library

[13]

Guha, S., Milshra, N., Motwani, R., and O'Callaghan, L. (2000). Clustering data streams. In IEEE Symposium on Foundations of Computer Science (FOCS), pages 359--366.

Digital Library

[14]

Hulten, G., Spencer, L., and Domingos, P. (2001). Mining time-changing data streams. In Int'l Conf. on Knowledge Discovery and Data Mining (SIGKDD), pages 97--106, San Francisco, CA. ACM Press.

Digital Library

[15]

Street, W. N. and Kim, Y. (2001). A streaming ensemble algorithm (SEA) for large-scale classification. In Int'l Conf. on Knowledge Discovery and Data Mining (SIGKDD).

Digital Library

[16]

Wang, H., Fan, W., Yu, P., and Han, J. (2003). Mining concept-drifting data streams with ensemble classifiers. In Proceedings of ACM SIGKDD International Conference on knowledge discovery and data mining (SIGKDD2003), pages 226--235.

Digital Library

Cited By

Pai YSun NLi CLin S(2024)Incremental Data Drifting: Evaluation Metrics, Data Generation, and Approach ComparisonACM Transactions on Intelligent Systems and Technology10.1145/365563015:4(1-26)Online publication date: 24-May-2024
https://dl.acm.org/doi/10.1145/3655630
Hananya RKatz G(2024)Dynamic Selection of Machine Learning Models for Time-Series DataInformation Sciences10.1016/j.ins.2024.120360(120360)Online publication date: Feb-2024
https://doi.org/10.1016/j.ins.2024.120360
Khezri STanha JSamadi N(2024)An experimental review of the ensemble-based data stream classification algorithms in non-stationary environmentsComputers and Electrical Engineering10.1016/j.compeleceng.2024.109420118(109420)Online publication date: Sep-2024
https://doi.org/10.1016/j.compeleceng.2024.109420
Show More Cited By

Index Terms

Systematic data selection to mine concept-drifting data streams
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Cleansing Noisy Data Streams
ICDM '08: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining

In this paper, we identify a new research problem on cleansing noisy data streams which contain incorrectly labeled training examples. The objective is to accurately identify and remove mislabeled data, such that the prediction models built from the ...
Frequent itemsets mining in data streams using reconfigurable hardware
NFMCP'15: Proceedings of the 4th International Conference on New Frontiers in Mining Complex Patterns

Data streams are unbounded and infinite flows of data arriving at high rates which cannot be stored for offline processing. Because of this, classical approaches for Data Mining cannot be used straightforwardly in data stream scenario. This paper ...
An adaptive ensemble classifier for mining concept drifting data streams

It is challenging to use traditional data mining techniques to deal with real-time data stream classifications. Existing mining classifiers need to be updated frequently to adapt to the changes in data streams. To address this issue, in this paper we ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

August 2004

874 pages

ISBN:1581138881

DOI:10.1145/1014052

General Chairs:
Won Kim
Cyber Database Solutions
,
Ronny Kohavi
Amazon.com
,
Program Chairs:
Johannes Gehrke
Cornell University
,
William DuMouchel
AT&T Labs Research

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 August 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

KDD04

Sponsor:

KDD04: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 22 - 25, 2004

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

139
Total Citations
View Citations
1,847
Total Downloads

Downloads (Last 12 months)45
Downloads (Last 6 weeks)1

Reflects downloads up to 04 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pai YSun NLi CLin S(2024)Incremental Data Drifting: Evaluation Metrics, Data Generation, and Approach ComparisonACM Transactions on Intelligent Systems and Technology10.1145/365563015:4(1-26)Online publication date: 24-May-2024
https://dl.acm.org/doi/10.1145/3655630
Hananya RKatz G(2024)Dynamic Selection of Machine Learning Models for Time-Series DataInformation Sciences10.1016/j.ins.2024.120360(120360)Online publication date: Feb-2024
https://doi.org/10.1016/j.ins.2024.120360
Khezri STanha JSamadi N(2024)An experimental review of the ensemble-based data stream classification algorithms in non-stationary environmentsComputers and Electrical Engineering10.1016/j.compeleceng.2024.109420118(109420)Online publication date: Sep-2024
https://doi.org/10.1016/j.compeleceng.2024.109420
Ali UMahmood T(2024)A novel framework for concept drift detection using autoencoders for classification problems in data streamsInternational Journal of Machine Learning and Cybernetics10.1007/s13042-024-02223-2Online publication date: 3-Jun-2024
https://doi.org/10.1007/s13042-024-02223-2
Sun XLv Z(2023)Exploring nonlinear spatiotemporal effects for personalized next point-of-interest recommendation一种基于非线性时空效应的个性化下一个兴趣点推荐方法Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.220030424:9(1273-1286)Online publication date: 22-Sep-2023
https://doi.org/10.1631/FITEE.2200304
Xie HLiu XGuo L(2023)Semi-supervised One-pass Learning under Distribution ShiftProceedings of the 2023 6th International Conference on Big Data Technologies10.1145/3627377.3627446(306-313)Online publication date: 22-Sep-2023
https://dl.acm.org/doi/10.1145/3627377.3627446
Yang YZhou DZhan DXiong HJiang YYang J(2023)Cost-Effective Incremental Deep Model: Matching Model Capacity With the Least SamplingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.313262235:4(3575-3588)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TKDE.2021.3132622
Sun XLv Z(2021)Deep Session Interest Network Based on the Time Interval Encoding for the Click-through Rate Prediction2021 IEEE International Conference on Computer Science, Artificial Intelligence and Electronic Engineering (CSAIEE)10.1109/CSAIEE54046.2021.9543196(206-212)Online publication date: 20-Aug-2021
https://doi.org/10.1109/CSAIEE54046.2021.9543196
Li PWu MHe JHu X(2021)Recurring Drift Detection and Model Selection-Based Ensemble Classification for Data Streams with Unlabeled DataNew Generation Computing10.1007/s00354-021-00126-2Online publication date: 20-Apr-2021
https://doi.org/10.1007/s00354-021-00126-2
Huang JZhong MJaysawal B(2020)TADILOF: Time Aware Density-Based Incremental Local Outlier Detection in Data StreamsSensors10.3390/s2020582920:20(5829)Online publication date: 15-Oct-2020
https://doi.org/10.3390/s20205829
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents