[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Improved search strategies and extensions to k-medoids-based clustering algorithms

Published: 01 September 2008 Publication History

Abstract

In this paper two categories of improvements are suggested that can be applied to most k-medoids-based algorithms - conceptual/algorithmic improvements, and implementational improvements. These include the revisiting of the accepted cases for swap comparison and the application of partial distance searching and previous medoid indexing to clustering. Various hybrids are then applied to a number of k-medoids-based algorithms and the method is shown to be generally applicable. Experimental results on both artificial and real datasets demonstrate that when applied to CLARANS the number of distance calculations can be reduced by up to 98%.

References

[1]
Babu, T.R. and Murty, M.N. (2001) 'Comparison of genetic algorithm based prototype selection schemes', Pattern Recognition, Vol. 34, pp. 523-525.
[2]
Baek, S.J., Jeon, B.K. and Sung, K.M. (1997) 'A fast encoding algorithm for vector quantization', IEEE Signal Processing Letters, Vol. 4, No. 2, pp. 325-327.
[3]
Bei, C.D. and Gray, R.M. (1985) 'A improvement of the minimum distortion encoding algorithm for vector quantization', IEEE Transactions on Communication COM, Vol. 33, No. 10, pp. 1132-1133.
[4]
Berkhin, R. (2002) Survey of Clustering Data Mining Techniques, Technical Report, Accrue Software Inc.
[5]
Ceglar, A., Morrall, R. and Roddick, J. (2006) 'Mining medical administrative data - the PKB system', in Ackermann, M., Soares, C. and Guidemann, B. (Eds.): ECML PKDD 2006 Workshop on Practical Data Mining: Applications, Experiences and Challenges, Berlin, pp. 59-66.
[6]
Cetin, A. and Weerackody, V. (1988) 'Design of vector quantizers using simulated annealing', IEEE Transactions on Circuits and Systems, Vol. 35, No. 12, pp. 1550-1555.
[7]
Chen, S.H. and Pan, J.S. (1989) 'Fast search algorithm for vq-based recognition of isolated word', IEE Proc. I, Vol. 136, No. 6, pp. 391-396.
[8]
Chu, S.C., Roddick, J.F. and Pan, J.S. (2001) 'A comparative study and extensions to k-medoids algorithms', in Li, D. (Ed.): Fifth International Conference on Optimization: Techniques and Applications (ICOTA 2001), Vol. 4, Hong Kong, pp. 1708-1717.
[9]
Chu, S.C., Roddick, J.F. and Pan, J.S. (2002) 'An efficient k-medoids-based algorithm using previous medoid index, triangular inequality elimination criteria and partial distance search', in Kambayashi, Y., Winiwarter, W. and Arikawa, M. (Eds.): 4th International Conference on Data Warehousing and Knowledge Discovery (DaWaK 2002), Springer, Aix-en-Provence, France, pp. 63-72.
[10]
Enright, A.J. and Ouzounis, C.A. (2000) 'Generage: a robust algorithm for sequence clustering and domain detection', Bioinformatics, Vol. 16, No. 5, pp. 451-457.
[11]
Ester, M., Kriegel, H-P., Sander, J. and Xu, X. (1996) 'A density-based algorithm for discovering clusters in large spatial databases with noise', in Simoudis, E., Han, J. and Fayyad, U. (Eds.): Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, Portland, Oregon, pp. 226-231.
[12]
Estivill-Castro, V. and Lee, I. (2000) 'AUTOCLUST: automatic clustering via boundary extraction for massive point-data sets', Fifth International Conference on Geocomputation, Springer, pp. 23-25.
[13]
Fisher, D., Xu, L., Carnes, J., Reich, Y., Fenves, S., Chen, J., Shiavi, R., Biswas, G. and Weinberg, J. (1993) 'Applying ai clustering to engineering tasks', IEEE Intelligent Systems, Vol. 8, No. 6, pp. 51-60.
[14]
Gamal, A., Hemachandra, L., Shperling, I. and Wei, V. (1987) 'Using simulated annealing to design good codes', IEEE Transactions on Information Theory, Vol. 33, No. 1, pp. 116-123.
[15]
Ganti, V., Gehrke, J. and Ramakrishnan, R. (1999) 'CACTUS - clustering categorical data using summaries', International Conference on Knowledge Discovery and Data Mining, San Diego, USA, pp. 73-83.
[16]
Gersho, A. and Gray, R.M. (1992) Vector Quantization and Signal Compression, Kluwer, Boston, MA.
[17]
Greene, D., Tsymbal, A., Bolshakova, N. and Cunningham, P. (2004) 'Ensemble clustering in medical diagnostics', in Long, R., Antani, S., Lee, D., Nutter, B. and Zhang, M. (Eds.): 17th IEEE Symposium on Computer-Based Medical Systems, CBMS 2004, Bethesda, Maryland, pp. 576-581.
[18]
Guan, L. and Kamel, M. (1992) 'Equal-average hyperplane partitioning method for vector quantization of image data', Pattern Recognition Letters, Vol. 13, No. 10, pp. 693-699.
[19]
Guha, S., Rastogi, R. and Shim, K. (1998) 'Cure: an efficient clustering algorithm for large databases', ACM SIGMOD International Conference on the Management of Data, Seattle, WA, USA, pp. 73-84.
[20]
Han, J., Kamber, M. and Tung, A.K.H. (2001) 'Spatial clustering methods in data mining: a survey', Geographic Data Mining and Knowledge Discovery, Taylor and Francis, London, Research Monographs in Geographic Information Sytems.
[21]
Huang, H.C., Pan, J.S., Lu, Z.M., Sun, S.H. and Hang, H.M. (2001) 'Vector quantization based on genetic simulated annealing', Signal Processing, Vol. 81, No. 7, pp. 1513-1523.
[22]
Jain, A.K., Murty, M. and Flynn, P. (1999) 'Data clustering: a review', ACM Computing Surveys, Vol. 31, No. 3, pp. 264-323.
[23]
Jolion, J., Meer, P. and Bataouche, S. (1991) 'Robust clustering with applications in computer vision', IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 13, No. 8, pp. 791-802.
[24]
Karypis, G., Han, E-H. and Kumar, V. (1999) 'CHAMELEON: a hierarchical clustering algorithm using dynamic modeling', Computer, Vol. 32, pp. 32-68.
[25]
Kaufman, L. and Rousseeuw, P. (1990) Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley and Sons, New York.
[26]
Kirkpatrick, S., Gelatt, J.C.D. and Vecchi, M.P. (1983) 'Optimization by simulated annealing', Science, Vol. 220, No. 4598, pp. 671-680.
[27]
Krishnapuram, R., Joshi, A. and Yi, L. (1999) 'A fuzzy relative of the k-medoids algorithm with application to web document and snippet clustering', IEEE International Fuzzy Systems Conference, Seoul, Korea, pp. 1281-1286.
[28]
Lecompte, D., Kaufman, L. and Rousseeuw, P. (1986) 'Hierarchical cluster analysis of emotional concerns and personality characteristics in a freshman population', Acta Psychiatrica Belgica, Vol. 86, pp. 324-333.
[29]
Lee, C.H. and Chen, L.H. (1994) 'Fast closest codeword search algorithm for vector quantization', IEE Proc. Vision Image and Signal Processing, Vol. 141, No. 3, pp. 143-148.
[30]
Lucasius, C.B., Dane, A.D. and Kateman, G. (1993) 'On k-medoid clustering of large data sets with the aid of a genetic algorithm: background, feasibility and comparison', Analytica Chimica Acta, pp. 647-669.
[31]
MacQueen, J. (1967) 'Some methods for classification and analysis of multivariate observations', 5th Berkeley Symposium on Mathematics, Statistics and Probability, Vol. 1, pp. 281-296.
[32]
Markl, V., Ramsak, F. and Bayer, R. (1999) 'Improving OLAP performance by multidimensional hierarchical clustering', Proc. IDEAS Conference, Montreal, Canada, pp. 165-177.
[33]
Mobasher, B., Cooley, R. and Srivastava, J. (1999) 'Creating adaptive web sites through usage-based clustering of urls', Knowledge and Data Engineering Workshop, Chicago, IL, USA, pp. 19-25.
[34]
Ng, R. (1996) 'Spatial data mining: Discovering knowledge of clusters from maps', ACM SIGMOD Workshop on Research Issue on Data Mining and Knowledge Discovery, Montreal, Canada.
[35]
Ng, R. and Han, J. (1994) 'Efficient and effective clustering methods for spatial data mining', in Bocca, J.B., Jarke, M. and Zaniolo, C. (Eds.): Twentieth International Conference on Very Large Data Bases, Morgan Kaufmann, Santiago, Chile, pp. 144-155.
[36]
Ng, R. and Han, J. (2002) 'CLARANS: a method for clustering objects for spatical data mining', IEEE Transactions on Knowledge and Data Engineering, Vol. 14, No. 5, pp. 1003-1016.
[37]
Pan, J., McInnes, F. and Jack, M. (1996a) 'Bound for minkowski metric or quadratic metric applied to vq codeword search', IEE Proceedings on Vision, Image and Signal Processing, Vol. 143, No. 1, pp. 67-71.
[38]
Pan, J.S., Lu, Z.M. and Sun, S.H. (2000) 'A fast codeword search algorithm for image coding based on mean-variance pyramids of codewords', IEE Electronics Letters, Vol. 36, No. 3, pp. 210-211.
[39]
Pan, J.S., McInnes, F.R. and Jack, M.A. (1996b) 'Fast clustering algorithm for vector quantization', Pattern Recognition, Vol. 29, No. 3, pp. 511-518.
[40]
Pan, J.S., Wang, J., Fang, H.L. and Chen, C. (1998) 'A modifed tabu search approach for texture segmentation using 2 - d non-separable wavelet frames', 10th International Conference on Tools with Artificial Intelligence, pp. 474-481.
[41]
Sander, J., Ester, M., Kriegel, H-P. and Xu, X. (1998) 'Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications', Data Mining and Knowledge Discovery, Vol. 2, No. 2, pp. 169-194.
[42]
Sato, M., Sato, Y. and Jain, L. (1997) Fuzzy Clustering Models and Applications, Physica-Verlag, Germany.
[43]
Soleymani, M.R. and Morgera, S.D. (1987) 'A heigh-speed algorithm for vector quantization', IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1946-1948.
[44]
Vecchi, M.P. and Kirkpatrick, S. (1983) 'Global wiring by simulated annealing', IEEE Transactions on Computer-Aided Design, Vol. 2, No. 4, pp. 215-222.
[45]
Vidal, E. (1986) 'An algorithm for finding nearest neighbours in (approximately) constant average time', Pattern Recognition Letters, Vol. 4, pp. 145-157.
[46]
von Luxburg, U. and Ben-David, S. (2005) 'Towards a statistical theory of clustering', PASCAL Workshop on Statistics and Optimization of Clustering, London, UK.
[47]
Wang, X. and Hamilton, H. (2003) 'Dbrs: a density-based spatial clustering method with random sampling', 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2003), Vol. 2637 of LNCS, Springer, Seoul, Korea, pp. 563-575.
[48]
Xu, R. (2005) 'Survey of clustering algorithms', IEEE Transactions on Neural Networks, Vol. 16, No. 3, pp. 645-678.
[49]
Zamir, O., Etzion, O., Mandani, O. and Karp, R. (1997) 'Fast and intuitive clustering of web documents', in Heckerman, D., Mannila, H., Pregibon, D. and Uthurusamy, R. (Eds.): Third International Conference on Knowledge Discovery and Data Mining, AAAI Press, Menlo Park, California, Newport Beach, CA, USA, pp. 287-290.
[50]
Zhang, T., Ramakrishnan, R. and Livny, M. (1996) 'BIRCH: an efficient clustering method for very large databases', ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Montreal, Canada, pp. 103-114.

Cited By

View all
  • (2014)Open issues for partitioning clustering methodsWiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery10.1002/widm.11274:3(161-177)Online publication date: 1-May-2014

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of Business Intelligence and Data Mining
International Journal of Business Intelligence and Data Mining  Volume 3, Issue 2
September 2008
111 pages
ISSN:1743-8195
EISSN:1743-8187
Issue’s Table of Contents

Publisher

Inderscience Publishers

Geneva 15, Switzerland

Publication History

Published: 01 September 2008

Author Tags

  1. CLARA
  2. CLARANS
  3. CLASA
  4. PDS
  5. PMI
  6. TIE
  7. clustering large applications
  8. partial distance search
  9. previous medoid index
  10. randomised search
  11. search strategies
  12. simulated annealing
  13. triangular inequality elimination

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2014)Open issues for partitioning clustering methodsWiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery10.1002/widm.11274:3(161-177)Online publication date: 1-May-2014

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media