[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/775047.775124acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Mining complex models from arbitrarily large databases in constant time

Published: 23 July 2002 Publication History

Abstract

In this paper we propose a scaling-up method that is applicable to essentially any induction algorithm based on discrete search. The result of applying the method to an algorithm is that its running time becomes independent of the size of the database, while the decisions made are essentially identical to those that would be made given infinite data. The method works within pre-specified memory limits and, as long as the data is iid, only requires accessing it sequentially. It gives anytime results, and can be used to produce batch, stream, time-changing and active-learning versions of an algorithm. We apply the method to learning Bayesian networks, developing an algorithm that is faster than previous ones by orders of magnitude, while achieving essentially the same predictive performance. We observe these gains on a series of large databases "generated from benchmark networks, on the KDD Cup 2000 e-commerce data, and on a Web log containing 100 million requests.

References

[1]
D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15:201--221, 1994.
[2]
C. Domingo, R. Gavalda, and O. Watanabe. Adaptive sampling methods for scaling up knowledge discovery algorithms. Data Mining and Knowledge Discovery, 6:131--152, 2002.
[3]
P. Domingos and G. Hulten. Mining high-speed data streams. In Proc. 6th ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining, pp. 71--80, Boston, MA, 2000.
[4]
P. Domingos and G. Hulten. Learning from infinite data in finite time. In Advances in Neural Information Processing Systems 14. MIT Press, Cambridge, MA, 2002.
[5]
N. Friedman, I. Nachman, and D. Peér. Learning Bayesian network structure from massive datasets: The "sparse candidate" algorithm. In Proc. 15th Conf. on Uncertainty in Artificial Intelligence, pp. 206--215, Stockholm, Sweden, 1999.
[6]
N. Friedman and Z. Yakhini. On the sample complexity of learning Bayesian networks. In Proc. 12th Conf. on Uncertainty in Artificial Intelligence, pp. 274--282, Portland, OR, 1996.
[7]
R. Greiner. PALO: A probabilistic hill-climbing algorithm. Artificial Intelligence, 84:177--208, 1996.
[8]
D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197--243, 1995.
[9]
W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13--30, 1963.
[10]
G. Hulten and P. Domingos. A general method for scaling up learning algorithms and its application to Bayesian networks. Technical report, Department of Computer Science and Engineering, University of Washington, Seattle, WA, 2002.
[11]
G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proc. 7th ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining, pp. 97--106, San Francisco, CA, 2001.
[12]
R. Kohavi, C. Brodley, B. Frasca, L. Mason, and Z. Zheng. KDD-Cup 2000 organizers' report: Peeling the onion. SIGKDD Explorations, 2(2):86--98, 2000.
[13]
O. Maron and A. Moore. Hoeffding races: Accelerating model selection search for classification and function approximation. In Advances in Neural Information Processing Systems 6. Morgan Kaufmann, San Mateo, CA, 1994.
[14]
C. Meek, B. Thiesson, and D. Heckerman. The learning-curve method applied to model-based clustering. Journal of Machine Learning Research, 2:397--418, 2002.
[15]
F. Provost, D. Jensen, and T. Oates. Efficient progressive sampling. In Proc. 5th ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining, pp. 23--32, San Diego, CA, 1999.
[16]
T. Scheffer and S. Wrobel. Incremental maximization of non-instance-averaging utility functions with applications to knowledge discovery problems. In Proc. 18th International Conf. on Machine Learning, pp. 481--488, Williamstown, MA, 2001.
[17]
A. Wald. Sequential analysis. Wiley, New York, 1947.

Cited By

View all
  • (2023)FPGA-Accelerated Causal Discovery with Conditional Independence Test Prioritization2023 33rd International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL60245.2023.00033(182-188)Online publication date: 4-Sep-2023
  • (2021)Fast Corotated Elastic SPH Solids with Implicit Zero-Energy Mode ControlProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/34801424:3(1-21)Online publication date: 27-Sep-2021
  • (2021)Visual Simulation of Soil-Structure Destruction with Seepage FlowsProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/34801414:3(1-18)Online publication date: 27-Sep-2021
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
July 2002
719 pages
ISBN:158113567X
DOI:10.1145/775047
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2002

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Bayesian networks
  2. Hoeffding bounds
  3. discrete search
  4. scalable learning algorithms
  5. subsampling

Qualifiers

  • Article

Conference

KDD02
Sponsor:

Acceptance Rates

KDD '02 Paper Acceptance Rate 44 of 307 submissions, 14%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)FPGA-Accelerated Causal Discovery with Conditional Independence Test Prioritization2023 33rd International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL60245.2023.00033(182-188)Online publication date: 4-Sep-2023
  • (2021)Fast Corotated Elastic SPH Solids with Implicit Zero-Energy Mode ControlProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/34801424:3(1-21)Online publication date: 27-Sep-2021
  • (2021)Visual Simulation of Soil-Structure Destruction with Seepage FlowsProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/34801414:3(1-18)Online publication date: 27-Sep-2021
  • (2016)Mining Decision Trees from StreamsData Stream Management10.1007/978-3-540-28608-0_9(189-208)Online publication date: 12-Jul-2016
  • (2015)A real time processing and streaming of wireless network data using Storm2015 International Conference on Computation of Power, Energy, Information and Communication (ICCPEIC)10.1109/ICCPEIC.2015.7259471(0244-0249)Online publication date: Apr-2015
  • (2015)Adaptive Online Neural Network for Face Identification with Concept DriftIntelligent Systems'201410.1007/978-3-319-11310-4_61(703-712)Online publication date: 2015
  • (2012)Monte Carlo MCMCProceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning10.5555/2390948.2391072(1104-1113)Online publication date: 12-Jul-2012
  • (2012)Spherical cubesCommunications of the ACM10.1145/2347736.234775755:10(90-97)Online publication date: 1-Oct-2012
  • (2012)A few useful things to know about machine learningCommunications of the ACM10.1145/2347736.234775555:10(78-87)Online publication date: 1-Oct-2012
  • (2012)The foresight saga, reduxCommunications of the ACM10.1145/2347736.234774655:10(26-29)Online publication date: 1-Oct-2012
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media