[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/233269.233324acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article
Free access

BIRCH: an efficient data clustering method for very large databases

Published: 01 June 1996 Publication History

Abstract

Finding useful patterns in large datasets has attracted considerable interest recently, and one of the most widely studied problems in this area is the identification of clusters, or densely populated regions, in a multi-dimensional dataset. Prior work does not adequately address the problem of large datasets and minimization of I/O costs.This paper presents a data clustering method named BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies), and demonstrates that it is especially suitable for very large databases. BIRCH incrementally and dynamically clusters incoming multi-dimensional metric data points to try to produce the best quality clustering with the available resources (i.e., available memory and time constraints). BIRCH can typically find a good clustering with a single scan of the data, and improve the quality further with a few additional scans. BIRCH is also the first clustering algorithm proposed in the database area to handle "noise" (data points that are not part of the underlying pattern) effectively.We evaluate BIRCH's time/space efficiency, data input order sensitivity, and clustering quality through several experiments. We also present a performance comparisons of BIRCH versus CLARANS, a clustering method proposed recently for large datasets, and show that BIRCH is consistently superior.

References

[1]
Peter C, heeselnan, James Kelly, Matthew Self, et al., Auto CIass : A Bayesian Classification System, Proc. of the 5th Int'l Conf. on Machine Learning, Morgan Kaufman, 3un. 1988.
[2]
Richard Duds, and Peter E. Hart, Pattern Classification and Scene Analysis, Wiley, 1973.
[3]
R. Dubes, and A.K. Jain, Clustering Methodologies in Exploratory Data Analysis Advances in C, omputers, Edited by M.C. Yovits, Vol. 19, Academic Press, New York, 1980.
[4]
Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu, A Database Interface .for Clustering in Large Spatial Databases, Proc. of 1st {nt'l Conf. on Knowledge Discovery and Data Mining, 1995.
[5]
Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu, Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification, Proc. of 4th Int'l Symposium on Large Spatial Databases, Portland, Maine, U.S.A., 1995.
[6]
Douglas H. Fisher, Knowledge Acquisition via Incremental Conceptual Clustering, Machine Learning, 2(2), 1987
[7]
Douglas H. Fisher, Iterative Optimization and Simplification of Hierarchical CIuaterings, Technical Report CS-95-01, Dept. of Computer Science, Vanderbilt l lniversity, Nashville, TN 37235.
[8]
A. Gersho and P~. Gray, Vector quantization and signal compression, Boston, Ms.: Kluwer Academic Publishers, 1992.
[9]
Leonard Kaufman, and Peter J. Rousseeuw, Finding Groups in Data - An Introduction to Cluster Analysis, Wiley Series in Probability and Mathematical Statistics, 1990.
[10]
Michael Lebowitz, Experiments with Incremental Concept Formation : UNIMEM, Machine Learning, 1987.
[11]
R.C.T.Lee, Clustering analysis and its applications, Advances in Information Systems Science, Edited by J .T.Toum, Vol. 8, pp. 169-292, Plenum Press, New York, 1981.
[12]
F. Murtagh, A Survey of _Recent Advances in Hier'archical Clustering Algorithms, The Computer Journal, 1933.
[13]
Raymond T. Ng and Jiawei Hart, Efficient and Effective Clustering Methods for,Spatial Data Mining, Proc. of VLDB, 1994.
[14]
(:lark F. Olson, Parallel Algorithms for Hierarchical Clustering, Technical Report, Computer Science Division, l.lniv, of California at Berkeley, Dec.,1993.
[15]
Tian Zhang, Raghu Ramakrishnan, and Miron Livl~y, BIRCH: An Efficient Data Clustering Method .for Very Large Databases, Technical Report, Computer Sciences Dept., Univ. of Wisconsin-Madison, 1995.

Cited By

View all
  • (2024)Two-Step Clustering for Mineral Prospectivity Mapping: A Case Study from the Northeastern Edge of the Jiaolai Basin, ChinaMinerals10.3390/min1411108914:11(1089)Online publication date: 28-Oct-2024
  • (2024)Enhancing LS-PIE’s Optimal Latent Dimensional Identification: Latent Expansion and Latent CondensationMathematical and Computational Applications10.3390/mca2904006529:4(65)Online publication date: 16-Aug-2024
  • (2024)Machine Learning-Based Structural Health Monitoring Technique for Crack Detection and Localisation Using Bluetooth Strain Gauge Sensor NetworkJournal of Sensor and Actuator Networks10.3390/jsan1306007913:6(79)Online publication date: 23-Nov-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '96: Proceedings of the 1996 ACM SIGMOD international conference on Management of data
June 1996
560 pages
ISBN:0897917944
DOI:10.1145/233269
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 1996

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS96

Acceptance Rates

SIGMOD '96 Paper Acceptance Rate 47 of 290 submissions, 16%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2,418
  • Downloads (Last 6 weeks)270
Reflects downloads up to 16 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Two-Step Clustering for Mineral Prospectivity Mapping: A Case Study from the Northeastern Edge of the Jiaolai Basin, ChinaMinerals10.3390/min1411108914:11(1089)Online publication date: 28-Oct-2024
  • (2024)Enhancing LS-PIE’s Optimal Latent Dimensional Identification: Latent Expansion and Latent CondensationMathematical and Computational Applications10.3390/mca2904006529:4(65)Online publication date: 16-Aug-2024
  • (2024)Machine Learning-Based Structural Health Monitoring Technique for Crack Detection and Localisation Using Bluetooth Strain Gauge Sensor NetworkJournal of Sensor and Actuator Networks10.3390/jsan1306007913:6(79)Online publication date: 23-Nov-2024
  • (2024)Relieving the burden of intensive labeling for stress monitoring in the wild by using semi-supervised learningFrontiers in Psychology10.3389/fpsyg.2023.129351314Online publication date: 5-Jan-2024
  • (2024)The Future of Behavioral Economics: AI Tools in the Digital Space10.15407/akademperiodyka.515.170Online publication date: 2024
  • (2024)IntelliMD: A Hybrid Approach for Local Misbehaviour Detection in Cooperative Intelligent Transport SystemsProceedings of the Sixth Workshop on CPS&IoT Security and Privacy10.1145/3690134.3694817(2-13)Online publication date: 19-Nov-2024
  • (2024)Exploring User Willingness towards Mobile Sensing and Intervention: A Case Study on Mental Health of Undergraduate College StudentsCompanion of the 2024 on ACM International Joint Conference on Pervasive and Ubiquitous Computing10.1145/3675094.3678422(721-728)Online publication date: 5-Oct-2024
  • (2024)BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point DetectionProceedings of the ACM on Software Engineering10.1145/36608051:FSE(2214-2237)Online publication date: 12-Jul-2024
  • (2024)Settling Time vs. Accuracy Tradeoffs for Clustering Big DataProceedings of the ACM on Management of Data10.1145/36549762:3(1-25)Online publication date: 30-May-2024
  • (2024)Efficient High-Quality Clustering for Large Bipartite GraphsProceedings of the ACM on Management of Data10.1145/36392782:1(1-27)Online publication date: 26-Mar-2024
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media