Clustering: | Guide books | ACM Digital Library

ClusteringOctober 2009

October 2009

Authors:
Rui Xu,
Don Wunsch

Publisher:

Wiley-IEEE Press

ISBN:978-0-470-27680-8

Published:24 October 2009

Pages:

358

Available at Amazon

Bibliometrics

Abstract

This is the first book to take a truly comprehensive look at clustering. It begins with an introduction to cluster analysis and goes on to explore: proximity measures; hierarchical clustering; partition clustering; neural network-based clustering; kernel-based clustering; sequential data clustering; large-scale data clustering; data visualization and high-dimensional data clustering; and cluster validation. The authors assume no previous background in clustering and their generous inclusion of examples and references help make the subject matter comprehensible for readers of varying levels and backgrounds.

Cited By

Contributors

Rui Xu
GE Global Research
- Publication Years2005 - 2012
- Publication counts13
- Citation count1,251
- Available for Download2
- Downloads (cumulative)1,552
- Downloads (12 months)7
- Downloads (6 weeks)0
- Average Downloads per Article776
- Average Citation per Article96
View Full Profile
Donald Coolidge Wunsch
Missouri University of Science and Technology
- Publication Years1991 - 2024
- Publication counts80
- Citation count1,741
- Available for Download6
- Downloads (cumulative)1,883
- Downloads (12 months)21
- Downloads (6 weeks)3
- Average Downloads per Article314
- Average Citation per Article22
View Full Profile

Index Terms

Clustering
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Theory of computation
  1. Design and analysis of algorithms

Reviews

Reviewer: Aris Gkoulalas-Divanis

Clustering algorithms are important to a wide spectrum of scientific disciplines, spanning from computer science (CS) and engineering to medical, earth, and social sciences. Applications of clustering are numerous: speech recognition, organization of document collections, disease diagnosis and treatment, star and planet classification, analysis of social networks, and criminal physiology, to name just a few. As a result, it is not surprising that more than 12,000 scientific papers related to clustering have been published since 1996. This book provides a comprehensive and thorough presentation of this research area, describing some of the most important clustering algorithms proposed in research literature. The book is organized into 11 chapters that highlight the various aspects of the clustering process. Chapter 1 is a brief introduction that discusses cluster analysis in general, defines the notion of clusters, and presents some interesting clustering applications. In the second chapter, Xu and Wunsch shed light on the different proximity measures that have been established to quantify the similarity between data records. The definition of similarity between two data records is one of the most important factors in the clustering process, as it provides the basis for the identification of high-quality clusters. After discussing the basic properties that a proximity measure must satisfy, the authors present a collection of measures that are suitable for continuous, discrete, and mixed variables. Chapters 3 to 9 are dedicated to specific clustering algorithms, technologies, and theories that have been proposed to facilitate clustering in different data domains and application environments. The last section of these chapters is devoted to the presentation of real-world applications, where the corresponding approaches are commonly adopted. In particular, chapter 3 collects clustering algorithms that organize the data records into a hierarchical structure; each level of this structure corresponds to a clustering solution of a different number of clusters. The clustering hierarchy can be built either in a bottom-up (agglomerative algorithm) or in a top-down (divisive algorithm) fashion. After presenting the classical hierarchical clustering schemes, the authors concentrate on a set of recent hierarchical approaches that are more robust to noise and outliers. Chapter 4 presents a set of partitional clustering solutions, where the data records are directly partitioned into a prespecified number of clusters. Xu and Wunsch present in detail the popular k -means algorithm and its advancements, as well as some graph theory, fuzzy, and search technique clustering methodologies. The use of neural networks in clustering is highlighted in chapter 5, where the authors discuss existing clustering approaches that are suitable for either hard or soft competitive learning. Chapter 6 is dedicated to kernel-based clustering solutions that map a set of nonlinearly separable patterns into a higher dimensional feature space, where they are linearly separable. After presenting the theory behind kernel-based clustering approaches, Xu and Wunsch discuss nonlinear principal component analysis, squared error-based clustering, and support vector kernel-based clustering. Chapters 7 to 9 are devoted to more recent applications involving the clustering of sequential, large-scale, or high-dimensional data. Specifically, chapter 7 focuses on the clustering of sequential data, commonly met in medical sciences. In this chapter, the authors present formulas to quantify sequence similarity, as well as three clustering algorithms that are suitable for sequential data. Following this, chapter 8 deals with the clustering of large-scale data, where the scalability of the clustering algorithm is a top priority. The existing methodologies are divided into six categories: random sampling, data condensation, density-based, grid-based, divide and conquer, and incremental learning. Then, in chapter 9, the authors present a set of methods for the clustering of high-dimensional data. As part of this chapter, both linear and nonlinear projection algorithms are investigated, along with projected and subspace clustering approaches. The role of data visualization is also emphasized. Chapter 10 presents metrics for the validation of the clustering results. The authors divide the existing metrics into three categories: external indices, internal indices, and relative indices. Finally, the last chapter of the book summarizes research challenges and presents trends in the area. The book targets researchers and graduate students in the clustering field. However, the book is easy to follow even by nonexperts, as it does not require significant background knowledge. On the positive side, the book covers a wide spectrum of real-world applications and provides rich references for further reading. On the negative side, although the book presents the workings of the algorithms with a reasonable degree of detail, it provides no specific examples of their operation. Furthermore, in some clustering algorithms, the authors do not discuss their bias to aspects such as the shape of the identified clusters and their robustness to outliers. Online Computing Reviews Service

Reviewer: Raphael M. Malyankar

The classification of tangible and intangible entities according to measurements or estimations of their characteristics is an old problem in science. Clustering techniques are used in many fields, including social sciences, natural sciences ranging from astrophysics to biology, information sciences ranging from machine learning to information retrieval, and commercial market research. Accordingly, there are many varieties of approaches. The corpus of clustering literature is very large and sometimes confusing, by virtue of the number of publications and diversity of descriptions in the science, mathematics, and computer science literature. The importance of this problem and the need to improve performance means that research into clustering algorithms continues to be valuable. This book attempts to describe the basic concepts and algorithms, as well as the state of the art in this field. The book begins with two introductory chapters: the first contains an introduction to the fundamental concepts of clustering and examples of its use; the second presents considerations and types of distance metrics for different kinds of data. Chapters 3 and 4 describe well-known algorithms for partitional clustering and hierarchical clustering, supplemented by descriptions of more recent developments in these types of algorithms. Approaches based on neural networks (chapter 5) and kernel methods (chapter 6) are covered next. Sequential data, ranging from time series to DNA sequences, presents its own special problems and requirements for clustering algorithms; chapter 7 describes the issues and solutions for this category of data, including a discussion of sequence alignment. Large data volumes present their own problems for classic and recent algorithms, caused by performance requirements; chapter 8 describes several adaptations and techniques for working with large collections of data. Algorithms for high-dimensional data are covered in chapter 9. Chapters 10 and 11 round out the book with a description of criteria for cluster validation-deciding whether the application of clustering methods detects an actually existing structure-and a discussion of general methods, requirements, and tradeoffs in the application of cluster analysis algorithms. Discussions of significant and interesting applications for each family of algorithms are included. Exercises in the theory are included, as are citations for the techniques described. The exposition is suitable for a graduate-level course or self-study by a professional. It is relatively easy to understand, given its subject matter and mathematical sophistication. Given the available space, the content and discussion are sometimes necessarily terse; the reader who is looking for details will need to refer to the primary literature cited. A degree of mathematical sophistication and a relevant background in statistics or related areas of engineering or computer science is necessary, preferably including an understanding of the preliminary concepts used in the text, such as neural networks and eigenvectors. The book covers a lot of ground in a relatively small number of pages, and should work well as a learning tool and reference. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Recommendations

Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

In this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Ant clustering algorithm with K-harmonic means clustering

Clustering is an unsupervised learning procedure and there is no a prior knowledge of data distribution. It organizes a set of objects/data into similar groups called clusters, and the objects within one cluster are highly similar and dissimilar with ...
Self-Organizing-Map Based Clustering Using a Local Clustering Validity Index

Classical clustering methods, such as partitioning and hierarchical clustering algorithms, often fail to deliver satisfactory results, given clusters of arbitrary shapes. Motivated by a clustering validity index based on inter-cluster and intra-cluster ...

Browse Books

Sections

Cited By

Index Terms

Reviews

Access critical reviews of Computing literature here

Hybrid Bisect K-Means Clustering Algorithm

Ant clustering algorithm with K-harmonic means clustering

Self-Organizing-Map Based Clustering Using a Local Clustering Validity Index

Save to Binder

Sections

Cited By

Save to Binder

Index Terms

Reviews

Access critical reviews of Computing literature here

Recommendations

Hybrid Bisect K-Means Clustering Algorithm

Ant clustering algorithm with K-harmonic means clustering

Self-Organizing-Map Based Clustering Using a Local Clustering Validity Index