More Web Proxy on the site http://driver.im/

Article

Mining complex models from arbitrarily large databases in constant time

Authors:

Pedro DomingosAuthors Info & Claims

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 525 - 531

https://doi.org/10.1145/775047.775124

Published: 23 July 2002 Publication History

Abstract

In this paper we propose a scaling-up method that is applicable to essentially any induction algorithm based on discrete search. The result of applying the method to an algorithm is that its running time becomes independent of the size of the database, while the decisions made are essentially identical to those that would be made given infinite data. The method works within pre-specified memory limits and, as long as the data is iid, only requires accessing it sequentially. It gives anytime results, and can be used to produce batch, stream, time-changing and active-learning versions of an algorithm. We apply the method to learning Bayesian networks, developing an algorithm that is faster than previous ones by orders of magnitude, while achieving essentially the same predictive performance. We observe these gains on a series of large databases "generated from benchmark networks, on the KDD Cup 2000 e-commerce data, and on a Web log containing 100 million requests.

References

[1]

D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning, 15:201--221, 1994.

[2]

C. Domingo, R. Gavalda, and O. Watanabe. Adaptive sampling methods for scaling up knowledge discovery algorithms. Data Mining and Knowledge Discovery, 6:131--152, 2002.

Digital Library

[3]

P. Domingos and G. Hulten. Mining high-speed data streams. In Proc. 6th ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining, pp. 71--80, Boston, MA, 2000.

Digital Library

[4]

P. Domingos and G. Hulten. Learning from infinite data in finite time. In Advances in Neural Information Processing Systems 14. MIT Press, Cambridge, MA, 2002.

[5]

N. Friedman, I. Nachman, and D. Peér. Learning Bayesian network structure from massive datasets: The "sparse candidate" algorithm. In Proc. 15th Conf. on Uncertainty in Artificial Intelligence, pp. 206--215, Stockholm, Sweden, 1999.

Digital Library

[6]

N. Friedman and Z. Yakhini. On the sample complexity of learning Bayesian networks. In Proc. 12th Conf. on Uncertainty in Artificial Intelligence, pp. 274--282, Portland, OR, 1996.

Digital Library

[7]

R. Greiner. PALO: A probabilistic hill-climbing algorithm. Artificial Intelligence, 84:177--208, 1996.

Digital Library

[8]

D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197--243, 1995.

[9]

W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58:13--30, 1963.

[10]

G. Hulten and P. Domingos. A general method for scaling up learning algorithms and its application to Bayesian networks. Technical report, Department of Computer Science and Engineering, University of Washington, Seattle, WA, 2002.

[11]

G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proc. 7th ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining, pp. 97--106, San Francisco, CA, 2001.

Digital Library

[12]

R. Kohavi, C. Brodley, B. Frasca, L. Mason, and Z. Zheng. KDD-Cup 2000 organizers' report: Peeling the onion. SIGKDD Explorations, 2(2):86--98, 2000.

Digital Library

[13]

O. Maron and A. Moore. Hoeffding races: Accelerating model selection search for classification and function approximation. In Advances in Neural Information Processing Systems 6. Morgan Kaufmann, San Mateo, CA, 1994.

[14]

C. Meek, B. Thiesson, and D. Heckerman. The learning-curve method applied to model-based clustering. Journal of Machine Learning Research, 2:397--418, 2002.

Digital Library

[15]

F. Provost, D. Jensen, and T. Oates. Efficient progressive sampling. In Proc. 5th ACM SIGKDD International Conf. on Knowledge Discovery and Data Mining, pp. 23--32, San Diego, CA, 1999.

Digital Library

[16]

T. Scheffer and S. Wrobel. Incremental maximization of non-instance-averaging utility functions with applications to knowledge discovery problems. In Proc. 18th International Conf. on Machine Learning, pp. 481--488, Williamstown, MA, 2001.

Digital Library

[17]

A. Wald. Sequential analysis. Wiley, New York, 1947.

Cited By

Guo CCupello DLuk WLevine JWarren ABrookes P(2023)FPGA-Accelerated Causal Discovery with Conditional Independence Test Prioritization2023 33rd International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL60245.2023.00033(182-188)Online publication date: 4-Sep-2023
https://doi.org/10.1109/FPL60245.2023.00033
Kugelstadt TBender JFernández-Fernández JJeske SLöschner FLongva A(2021)Fast Corotated Elastic SPH Solids with Implicit Zero-Energy Mode ControlProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/34801424:3(1-21)Online publication date: 27-Sep-2021
https://dl.acm.org/doi/10.1145/3480142
Wang XFujisawa MMikawa M(2021)Visual Simulation of Soil-Structure Destruction with Seepage FlowsProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/34801414:3(1-18)Online publication date: 27-Sep-2021
https://dl.acm.org/doi/10.1145/3480141
Show More Cited By

Index Terms

Mining complex models from arbitrarily large databases in constant time
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
      2. Logical and relational learning
        Inductive logic learning
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Mining time-changing data streams
KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining

Most statistical and machine-learning algorithms assume that the data is a random sample drawn from a stationary distribution. Unfortunately, most of the large databases available for mining today violate this assumption. They were gathered over months ...
Constant-Time Randomized Parallel String Matching

Given a pattern string of length m for the string-matching problem, we design an algorithm that computes deterministic samples of a sufficiently long substring of the pattern in constant time. This problem used to be the bottleneck in the pattern ...
Predecessor queries in constant time?
ESA'05: Proceedings of the 13th annual European conference on Algorithms

In this paper we design a new static data structure for batched predecessor queries. In particular, our data structure supports $O(\sqrt{{\rm log}n})$ queries in O(1) time per query and requires $O(n^{\epsilon\sqrt{{\rm log}n}})$ space for any ε > 0. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

July 2002

719 pages

ISBN:158113567X

DOI:10.1145/775047

Conference Chair:
Osmar R. Zaïane
University of Alberta, Canada
,
General Chair:
Randy Goebel
University of Alberta, Canada
,
Program Chairs:
David Hand
Imperial College, UK
,
Daniel Keim
AT&T
,
Raymond Ng
University of British Columbia, Canada

Copyright © 2002 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2002

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

KDD02

Sponsor:

KDD02: The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

July 23 - 26, 2002

Alberta, Edmonton, Canada

Acceptance Rates

KDD '02 Paper Acceptance Rate 44 of 307 submissions, 14%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

33
Total Citations
View Citations
593
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 30 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Guo CCupello DLuk WLevine JWarren ABrookes P(2023)FPGA-Accelerated Causal Discovery with Conditional Independence Test Prioritization2023 33rd International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL60245.2023.00033(182-188)Online publication date: 4-Sep-2023
https://doi.org/10.1109/FPL60245.2023.00033
Kugelstadt TBender JFernández-Fernández JJeske SLöschner FLongva A(2021)Fast Corotated Elastic SPH Solids with Implicit Zero-Energy Mode ControlProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/34801424:3(1-21)Online publication date: 27-Sep-2021
https://dl.acm.org/doi/10.1145/3480142
Wang XFujisawa MMikawa M(2021)Visual Simulation of Soil-Structure Destruction with Seepage FlowsProceedings of the ACM on Computer Graphics and Interactive Techniques10.1145/34801414:3(1-18)Online publication date: 27-Sep-2021
https://dl.acm.org/doi/10.1145/3480141
Hulten GDomingos P(2016)Mining Decision Trees from StreamsData Stream Management10.1007/978-3-540-28608-0_9(189-208)Online publication date: 12-Jul-2016
https://doi.org/10.1007/978-3-540-28608-0_9
Maske MPrasad P(2015)A real time processing and streaming of wireless network data using Storm2015 International Conference on Computation of Power, Energy, Information and Communication (ICCPEIC)10.1109/ICCPEIC.2015.7259471(0244-0249)Online publication date: Apr-2015
https://doi.org/10.1109/ICCPEIC.2015.7259471
Żarkowski M(2015)Adaptive Online Neural Network for Face Identification with Concept DriftIntelligent Systems'201410.1007/978-3-319-11310-4_61(703-712)Online publication date: 2015
https://doi.org/10.1007/978-3-319-11310-4_61
Singh SWick MMcCallum ATsujii JHenderson JPasca M(2012)Monte Carlo MCMCProceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning10.5555/2390948.2391072(1104-1113)Online publication date: 12-Jul-2012
https://dl.acm.org/doi/10.5555/2390948.2391072
Kindler GRao AO'Donnell RWigderson A(2012)Spherical cubesCommunications of the ACM10.1145/2347736.234775755:10(90-97)Online publication date: 1-Oct-2012
https://dl.acm.org/doi/10.1145/2347736.2347757
Domingos P(2012)A few useful things to know about machine learningCommunications of the ACM10.1145/2347736.234775555:10(78-87)Online publication date: 1-Oct-2012
https://dl.acm.org/doi/10.1145/2347736.2347755
Neumann P(2012)The foresight saga, reduxCommunications of the ACM10.1145/2347736.234774655:10(26-29)Online publication date: 1-Oct-2012
https://dl.acm.org/doi/10.1145/2347736.2347746
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten