[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/779232.779247guidebooksArticle/Chapter ViewAbstractPublication PagesBookacm-pubtype
chapter

Clustering in massive data sets

January 2002
Pages 501 - 543
Published: 01 January 2002 Publication History

Abstract

We review the time and storage costs of search and clustering algorithms. We exemplify these, based on case-studies in astronomy, information retrieval, visual user interfaces, chemical databases, and other areas. Theoretical results developed as far back as the 1960s still very often remain topical. More recent work is also covered in this article. This includes a solution for the statistical question of how many clusters there are in a dataset. We also look at one line of inquiry in the use of clustering for human-computer user interfaces. Finally, the visualization of data leads to the consideration of data arrays as images, and we speculate on future results to be expected here.

References

[1]
D. Allard and C. Fraley. Non-parametric maximum likelihood estimation of features in spatial point processes using Voronoi tessellation. Journal of the American Statistical Association, 92:1485-1493, 1997.]]
[2]
P. Arabie and L. J. Hubert. An overview of combinatorial data analysis, pages 5-63. In, Arabie et al. (1996), 1996.]]
[3]
P. Arabie, L. J. Hubert, and G. De Soete, editors. Clustering and Classification . Singapore: World Scientific, 1996.]]
[4]
S. Banerjee and A. Rosenfeld. Model-based cluster analysis. Pattern Recognition, 26:963-974, 1993.]]
[5]
J. D. Banfield and A. E. Raftery. Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803-821, 1993.]]
[6]
K. P. Bennett, U. Fayyad, and D. Geiger. Density-based indexing for approximate nearest neighbor queries. Technical report, Microsoft, 1999. Microsoft Research Technical Report MSR-TR-98-58.]]
[7]
J. L. Bentley and J. H. Friedman. Fast algorithms for constructing minimal spanning trees in coordinate spaces. IEEE Transactions on Computers, C-27:97-105, 1978.]]
[8]
J. L. Bentley, B. W. Weide, and A. C. Yao. Optimal expected time algorithms for closest point problems. ACM Transactions on Mathematical Software, 6:563-580, 1980.]]
[9]
M. W. Berry, Z. Drmac, and E. R. Jessup. Matrices, vector spaces, and information retrieval. SIAM Review, 41:335-362, 1999.]]
[10]
M. W. Berry, B. Hendrickson, and P. Raghavan. Sparse matrix reordering schemes for browsing hypertext. In J. Renegar, M. Shub, and S. Smale, editors, Lectures in Applied Mathematics (LAM) Vol. 32: The Mathematics of Numerical Analysis, pages 99-123. American Mathematical Society, 1996.]]
[11]
K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbor meaningful? In Proceedings of the 7th International Conference on Database Theory (ICDT), Jerusalem, Israel, 1999. in press.]]
[12]
A. Borodin, R. Ostrovsky, and Y. Rabani. Subquadratic approximation algorithms for clustering problems in high dimensional spaces. In Proc. 31st ACM Symposium on Theory of Computing (STOC-99), 1999.]]
[13]
A. J. Broder. Strategies for efficient incremental nearest neighbor search. Pattern Recognition, 23:171-178, 1990.]]
[14]
A. Z. Broder. On the resemblance and containment of documents, pages 21-29. IEEE Computer Society, 1998.]]
[15]
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proc. Sixth International World Wide Web Conference, pages 391-404, 1997.]]
[16]
M. Bruynooghe. Méthodes nouvelles en classification automatique des données taxinomiques nombreuses. Statistique et Analyse des Données, (3):24-42, 1977.]]
[17]
W. A. Burkhard and R. M. Keller. Some approaches to best-match file searching. Communications of the ACM, 16:230-236, 1973.]]
[18]
S. D. Byers and A. E. Raftery. Nearest neighbor clutter removal for estimating features in spatial point processes. Journal of the American Statistical Association, 93:577-584, 1998.]]
[19]
J. G. Campbell, C. Fraley, D. Stanford, F. Murtagh, and A. E. Raftery. Model-based methods for textile fault detection. International Journal of Imaging Science and Technology, 10:339-346, 1999.]]
[20]
Cartia. Mapping the information landscape, client-server software system, 1999. Caria, Inc.]]
[21]
G. Celeux and G. Covaert. Gaussian parsimonious clustering models. Pattern Recognition, 28:781-793, 1995.]]
[22]
D. Cheriton and D. E. Tarjan. Finding minimum spanning trees. SIAM Journal on Computing, 5:724-742, 1976.]]
[23]
K. W. Church and J. I. Helfman. Dotplot: a program for exploring self-similarity in millions of lines of text and code. Journal of Computational and Graphical Statistics, 2:153-174, 1993.]]
[24]
W. B. Croft. Clustering large files of documents using the single-link method. Journal of the American Society for Information Science, 28:341-344, 1977.]]
[25]
C. Darken, J. Chang, and J. Moody. Learning rate schedules for faster stochastic gradient search. In Neural Networks for Signal Processing 2, Proceedings of the 1992 IEEE Workshop, Piscataway, 1992. IEEE Press.]]
[26]
C. Darken and J. Moody. Note on learning rate schedules for stochastic optimization. In Lippmann, Moody, and Touretzky, editors, Advances in Neural Information Processing Systems 3. Morgan Kaufmann, Palo Alto, 1991.]]
[27]
C. Darken and J. Moody. Towards faster stochastic gradient search. In Hanson Moody and Lippmann, editors, Advances in Neural Information Processing Systems 4, San Mateo, 1992. Morgan Kaufman.]]
[28]
B. V. Dasarathy. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, New York, 1991.]]
[29]
A. Dasgupta and A. E. Raftery. Detecting features in spatial point processes with clutter via model-based clustering. Journal of the American Statistical Association, 93:294-302, 1998.]]
[30]
W. H. E. Day and H. Edelsbrunner. Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification, 1:7-24, 1984.]]
[31]
C. de Rham. La classification hiérarchique ascendante selon la méthode des voisins réciproques. Les Cahiers de l'Analyse des Données, V: 135-144, 1980.]]
[32]
D. Defays. An efficient algorithm for a complete link method. Computer Journal, 20:364-366, 1977.]]
[33]
C. Delannoy. Un algorithme rapide de recherche de plus proches voisins. RAIRO Informatique/Computer Science, 14:275-286, 1980.]]
[34]
A. R. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39:1-22, 1977.]]
[35]
A. Dobrzycki, H. Ebeling, K. Glotfelty, P. Freeman, F. Damiani, M. Elvis, and T. Calderwood. Chandra Detect 1.0 User Guide. Chandra X-Ray Center, 1999. Smithsonian Astrophysical Observatory, Version 0.9.]]
[36]
L. B. Doyle. Semantic road maps for literature searchers. Journal of the ACM, 8:553-578, 1961.]]
[37]
C. M. Eastman and S. F. Weiss. Tree structures for high dimensionality nearest neighbor searching. Information Systems, 7:115-122, 1982.]]
[38]
H. Ebeling and G. Wiedenmann. Detecting structure in two dimensions combining voronoi tessellation and percolation. Physical Review E, 47:704-714, 1993.]]
[39]
E. Forgy. Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics, 21:768, 1965.]]
[40]
C. Fraley. Algorithms for model-based Gaussian hierarchical clustering. SIAM Journal of Scientific Computing, 20:270-281, 1999.]]
[41]
C. Fraley and A. E. Raftery. How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal, 41:578-588, 1999.]]
[42]
J. H. Friedman, F. Baskett, and L. J. Shustek. An algorithm for finding nearest neighbors. IEEE Transactions on Computers, C-24:1000-1006, 1975.]]
[43]
J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3:209-226, 1977.]]
[44]
B. Fritzke. Some competitive learning methods, 1997.]]
[45]
K. Fukunaga and P. M. Narendra. A branch and bound algorithm for computing k-nearest neighbors. IEEE Transactions on Computers, C-24:750-753, 1975.]]
[46]
V. J. Gillet, D. J. Wild, P. Willett, and J. Bradshaw. Similarity and dissimilarity methods for processing chemical structure databases. The Computer Journal, 41:547-558, 1998.]]
[47]
A. D. Gordon. Classification. Champman and Hall, 2 edition, 1999.]]
[48]
A. Griffiths, L. A. Robinson, and P. Willett. Hierarchic agglomerative clustering methods for automatic document classification. Journal of Documentation, 40:175-205, 1984.]]
[49]
D. Guillaume and F. Murtagh. Clustering of XML documents. Computer Physics Communications, 1999. submitted.]]
[50]
M. E. Hodgson. Reducing the computational requirements of the minimum-distance classifier. Remote Sensing of Environment, 25:117-128, 1988.]]
[51]
E. Horowitz and S. Sahni. Fundamentals of Computer Algorithms. London: Pitman, 1979.]]
[52]
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Englewood Cliffs: Prentice-Hall, 1988.]]
[53]
G. Jammal and A. Bijaoui. Multiscale image restoration for photon imaging systems. In SPIE Conference on Signal and Image Processing: Wavelet Applications in Signal and Image Processing, VII, July 1999.]]
[54]
J. Juan. Programme de classification hiérarchique par l'algorithme de la recherche en chaîne des voisins réciproques. Les Cahiers de l'Analyse des Données, VII:219-225, 1982.]]
[55]
R. E. Kass and A. E. Raftery. Bayes factors. Journal of the American Statistical Association, 90:773-795, 1995.]]
[56]
J. Kittler. A method for determining k-nearest neighbors. Kybernetes, 7:313-315, 1978.]]
[57]
E. D. Kolaczyk. Nonparametric estimation of gamma-ray burst intensities using haar wavelets. Astrophysical Journal, 483:340-349, 1997.]]
[58]
E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbors in high-dimensional spaces. In Proc. of 30th ACM Symposium on Theory of Computing (STOC-30), 1998.]]
[59]
P. Lloyd. Least squares quantization in pcm. IEEE Transactions on Information Theory, 1982. Technical note, Bell Laboratories, 1957.]]
[60]
J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281-297, Berkeley, 1976. University of California Press.]]
[61]
R. B. Marimont and M. B. Shapiro. Nearest neighbor searches and the curse of dimensionality. Journal of the Institute of Mathematics and Its Applications, 24:59-70, 1979.]]
[62]
L. Micó, J. Oncina, and E. Vidal. An algorithm for finding nearest neighbors in constant average time with a linear space complexity. In The 11th International Conference on Pattern Recognition, volume II, pages 557-560, New York, 1992. IEEE Computer Science Press.]]
[63]
A. Moore. Very fast EM-based mixture model clustering using multiresolution kd-trees. Neural Information Processing Systems, December 1998.]]
[64]
R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.]]
[65]
S. Mukherjee, E. D. Feigelson, G. J. Babu, F. Murtagh, C. Fraley, and A. Raftery. Three types of gamma-ray bursts. The Astrophysical Journal, 508:314-327, 1998.]]
[66]
F. Murtagh. A very fast, exact nearest neighbor algorithm for use in information retrieval. Information Technology, 1:275-283, 1982.]]
[67]
F. Murtagh. Expected time complexity results for hierarchic clustering algorithms which use cluster centers. Information Processing Letters, 16:237-241, 1983.]]
[68]
F. Murtagh. Complexities of hierarchic clustering algorithms: state of the art. Computational Statistics Quarterly, 1:101-113, 1984.]]
[69]
F. Murtagh. Multidimensional Clustering Algorithms. Würzburg: Physica-Verlag, 1985.]]
[70]
F. Murtagh. Comments on 'parallel algorithms for hierarchical clustering and cluster validity'. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14:1056-1057, 1992.]]
[71]
F. Murtagh. Multivariate methods for data analysis. In A. Sandqvist and T.P. Ray, editors, Central Activity in Galaxies: From Observational Data to Astrophysical Diagnostics, pages 209-235, Berlin, 1993a. Springer-Verlag.]]
[72]
F. Murtagh. Search algorithms for numeric and quantitative data. In A. Heck and F. Murtagh, editors, Intelligent Information Retrieval: The Case of Astronomy and Related Space Sciences, pages 49-80, Dordrecht, 1993b. Kluwer Academic.]]
[73]
F. Murtagh. Foreword to the special issue on clustering and classification. The Computer Journal, 41:517, 1998.]]
[74]
F. Murtagh and A. Heck. Multivariate Data Analysis. Dordrecht: Kluwer Academic, 1987.]]
[75]
F. Murtagh and M. H. Pajares. The Kohonen self-organizing map method: an assessment. Journal of Classification, 12:165-190, 1995.]]
[76]
F. Murtagh and A.E. Raftery. Fitting straight lines to point patterns. Pattern Recognition, 17:479-483, 1984.]]
[77]
F. Murtagh and J. L. Starck. Pattern clustering based on noise modeling in wavelet space. Pattern Recognition, 31:847-855, 1998.]]
[78]
F. Murtagh, J. L. Starck, and M. Berry. Overcoming the curse of dimensionality in clustering by means of the wavelet transform. The Computer Journal, 43:107-120, 2000.]]
[79]
R. Neal and G. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In M. Jordan, editor, Learning in Graphical Models, pages 355-371, Dordrecht, 1998. Kluwer Academic Publisher.]]
[80]
H. Niemann and R. Goppert. An efficient branch-and-bound nearest neighbor classifier. Pattern Recognition Letters, 7:67-72, 1988.]]
[81]
B. K. Parsi and L. N. Kanal. An improved branch and bound algorithm for computing k-nearest neighbors. Pattern Recognition Letters, 3: 7-12, 1985.]]
[82]
D. Pelleg and A. Moore. Accelerating exact k-means algorithms with geometric reasoning. In Proceedings KDD-99, Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, August 1999.]]
[83]
S. A. Perry and P. Willett. A review of the use of inverted files for best match searching in information retrieval systems. Journal of Information Science, 6:59-66, 1983.]]
[84]
P. Poinçot, S. Lesteven, and F. Murtagh. A spatial user interface to the astronomical literature. Astronomy and Astrophysics Supplement, 130:183-191, 1998.]]
[85]
P. Poinçot, S. Lesteven, and F. Murtagh. Maps of information spaces: assessments from astronomy. Journal of the American Society for Information Science, 1999. submitted.]]
[86]
V. Ramasubramanian and K. K. Paliwal. An efficient approximation-algorithm for fast nearest-neighbor search based on a spherical distance coordinate formulation. Pattern Recognition Letters, 13:471-480, 1992.]]
[87]
M. Richetin, G. Rives, and M. Naranjo. Algorithme rapide pour la détérmination des k plus proches voisins. RAIRO Informatique/Computer Science, 14:369-378, 1980.]]
[88]
F. J. Rohlf. Algorithm 76: Hierarchical clustering using the minimum spanning tree. The Computer Journal, 16:93-95, 1973.]]
[89]
F. J. Rohlf. A probabilistic minimum spanning tree algorithm. Information Processing Letters, 7:44-48, 1978.]]
[90]
F. J. Rohlf. Single link clustering algorithms. In P. R. Krishnaiah and L.N. Kanal, editors, Handbook of Statistics, volume 2, pages 267-284, Amsterdam, 1982. North-Holland.]]
[91]
E. V. Ruiz. An algorithm for finding nearest neighbors in (approximately) constant average time. Pattern Recognition Letters, 4:145-157, 1986.]]
[92]
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval . New York: McGraw-Hill, 1983.]]
[93]
M. Sato and S. Ishii. Reinforcement learning based on on-line EM algorithm. In M.S. Kearns, S.A. Solla, and D.A. Cohn, editors, Advances m Neural Information Processing Systems 11, pages 1052-1058, Cambridge, 1999. MIT Press.]]
[94]
T. Schreiber. Efficient search for nearest neighbors. In A.S. Weigend and N.A Gershenfeld, editors, Predicting the Future and Understanding the Past: A Comparison of Approaches, New York, 1993. Addison-Wesley.]]
[95]
G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6:461-464, 1978.]]
[96]
SDSS. Sloan digital sky survey, 1999.]]
[97]
M. Shapiro. The choice of reference points in best-match file searching. Communications of the ACM, 20:339-343, 1977.]]
[98]
R. Sibson. Slink: an optimally efficient algorithm for the single link cluster method. The Computer Journal, 16:30-34, 1973.]]
[99]
A. F. Smeaton and C. J. van Rijsbergen. The nearest neighbor problem in information retrieval: an algorithm using upperbounds. ACM SIGIR Forum, 16:83-87, 1981.]]
[100]
P. H. A. Sneath and R. R. Sokal. Numerical Taxonomy. San Francisco: W. H. Freeman, 1973.]]
[101]
H. Späth. Cluster Dissection and Analysis: Theory, Fortran Programs, Examples. Chichester: Ellis Horwood, 1985.]]
[102]
J. L. Starck, F. Murtagh, and A. Bijaoui. Image and Data Analysis: The Multiscale Approach. New York: Cambridge University Press, 1998.]]
[103]
R. E. Tarjan. An improved algorithm for hierarchical clustering using strong components. Information Processing Letters, 17:37-41, 1983.]]
[104]
B. Thiesson, C. Meek, and D. Heckerman. Accelerating EM for large databases. Technical report, Microsoft, 1999. Microsoft Research Technical Report MST-TR-99-31.]]
[105]
S. F. Weiss. A probabilistic algorithm for nearest neighbor searching. In R. N. Oddy and et al., editors, Information Retrieval Research, pages 325-333, London, 1981. Butterworths.]]
[106]
H. D. White and K. W. McCain. Visualization of literatures. Annual Review of Information Science and Technology (ARIST), 32:99-168, 1997.]]
[107]
P. Willett. Efficiency of hierarchic agglomerative clustering using the icl distributed array processor. Journal of Documentation, 45:1-45, 1989.]]
[108]
A. C. Yao. An o(|e| log log |υ|) algorithm for finding minimum spanning trees. Information Processing Letters, 4:21-23, 1975.]]
[109]
T. P. Yunck. A technique to identify nearest neighbors. IEEE Transactions on Systems, Man, and Cybernetics, SMC-6:678-683, 1976.]]
[110]
C. T. Zahn. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, C-20:68-86, 1971.]]
[111]
G. Zheng, J. L. Starck, J.G. Campbell, and F. Murtagh. Multiscale transforms for filtering financial data streams. Journal of Computational Intelligence in Finance, 7, 1999.]]

Cited By

View all
  • (2014)Descriptive and prescriptive data cleaningProceedings of the 2014 ACM SIGMOD International Conference on Management of Data10.1145/2588555.2610520(445-456)Online publication date: 18-Jun-2014
  • (2009)Parallel Clustering Algorithm for Large Data Sets with Applications in BioinformaticsIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2007.702726:2(344-352)Online publication date: 1-Apr-2009
  • (2008)Modular neuroevolution for multilegged locomotionProceedings of the 10th annual conference on Genetic and evolutionary computation10.1145/1389095.1389136(265-272)Online publication date: 13-Jul-2008
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide books
Handbook of massive data sets
January 2002
1252 pages
ISBN:1402004893

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 January 2002

Author Tags

  1. clustering
  2. massive data sets

Qualifiers

  • Chapter

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2014)Descriptive and prescriptive data cleaningProceedings of the 2014 ACM SIGMOD International Conference on Management of Data10.1145/2588555.2610520(445-456)Online publication date: 18-Jun-2014
  • (2009)Parallel Clustering Algorithm for Large Data Sets with Applications in BioinformaticsIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2007.702726:2(344-352)Online publication date: 1-Apr-2009
  • (2008)Modular neuroevolution for multilegged locomotionProceedings of the 10th annual conference on Genetic and evolutionary computation10.1145/1389095.1389136(265-272)Online publication date: 13-Jul-2008
  • (2006)Information preserving multi-objective feature selection for unsupervised learningProceedings of the 8th annual conference on Genetic and evolutionary computation10.1145/1143997.1144248(1545-1552)Online publication date: 8-Jul-2006
  • (2001)Bit-sliced index arithmeticACM SIGMOD Record10.1145/376284.37566930:2(47-57)Online publication date: 1-May-2001
  • (2001)Bit-sliced index arithmeticProceedings of the 2001 ACM SIGMOD international conference on Management of data10.1145/375663.375669(47-57)Online publication date: 1-May-2001

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media