More Web Proxy on the site http://driver.im/

research-article

Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents

Author:

Diab AbuaiadahAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), Volume 15, Issue 3

Article No.: 17, Pages 1 - 13

https://doi.org/10.1145/2812809

Published: 28 January 2016 Publication History

Abstract

In this article, I have investigated the performance of the bisect K-means clustering algorithm compared to the standard K-means algorithm in the analysis of Arabic documents. The experiments included five commonly used similarity and distance functions (Pearson correlation coefficient, cosine, Jaccard coefficient, Euclidean distance, and averaged Kullback-Leibler divergence) and three leading stemmers. Using the purity measure, the bisect K-means clearly outperformed the standard K-means in all settings with varying margins. For the bisect K-means, the best purity reached 0.927 when using the Pearson correlation coefficient function, while for the standard K-means, the best purity reached 0.884 when using the Jaccard coefficient function. Removing stop words significantly improved the results of the bisect K-means but produced minor improvements in the results of the standard K-means. Stemming provided additional minor improvement in all settings except the combination of the averaged Kullback-Leibler divergence function and the root-based stemmer, where the purity was deteriorated by more than 10%. These experiments were conducted using a dataset with nine categories, each of which contains 300 documents.

References

[1]

D. Abuaiadah, J. El Sana, and W. Abusalah. 2014. On the impact of dataset characteristics on arabic document classification. International Journal of Computer Applications 101, 7, 31--38.

[2]

E. Al-Shammari and J. Lin. 2008. Towards an error-free arabic stemming. In Proceedings of the 2nd ACM workshop on Improving Non English Web Searching.

Digital Library

[3]

F. Archetti, P. Campanelli, E. Fersini, and E. Messina. 2006. A Hierarchical Document Clustering Environment Based on the Induced Bisecting k-Means. Springer, Berlin.

[4]

P. Berkhin. 2001. Survey of Clustering Data Mining Techniques. Retrieved from http://www.accrue.com/products/rp_cluster_review.pdf.

[5]

Q. Bsoul and M. Mohd. 2011. Effect of ISRI stemming on similarity measure for Arabic document clustering. In Proceedings of the 7th Asia Conference on Information Retrieval Technology (AIRS’11). 7097, 584--593.

Digital Library

[6]

R. Cathey, E. Jensen, S. Beitzel, O. Frieder, and D. Grossman. 2007. Exploiting parallelism to support scalable hierarchical clustering. Journal of the American Society for Information Science and Technology 58, 8, 1207--1221.

Digital Library

[7]

A. Chen and F. Gey. 2002. Building an Arabic stemmer for information retrieval. In NIST Special Publication 500-251: Proceedings of the 11th Text Retrieval Conference (TREC’02). Retrieved from http://trec.nist.gov/pubs/trec11/papers/ucalberkeley.chen.pdf.

[8]

E. Dang, R. Luk, K. Ho, S. Chan, and D. Lee. 2008. A new measure of clustering effectiveness: Algorithms and experimental studies. Journal of the American Society for Information Science and Technology 59, 3, 390--406.

Digital Library

[9]

T. El-Shishtawy and F. El-Ghannam. 2012. An accurate arabic root-based lemmatizer for information retrieval purposes. arXiv preprint arXiv:1203.3584.

[10]

M. Fahim, M. Salem, A. Torkey, and A. Ramadan. 2006. An efficient enhanced k-means clustering algorithm. Journal of Zhejiang University Science A 7, 10, 1626--1633.

[11]

H. Froud, A. Lachkar, and S. Ouatik. 2013a. Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering. International Journal of Data Mining & Knowledge Management Process (IJDKP). Vol. 3, 79--95.

[12]

H. Froud and A. Lachkar. 2013. Agglomerative hierarchical clustering techniques for arabic documents. In Advances in Computational Science, Engineering and Information Technology. Springer International Publishing, 255--267.

[13]

H. Froud, I. Sahmoudi, and A. Lachkar. 2013b. An efficient approach to improve arabic documents clustering based on a new keyphrases extraction algorithm. In Proceedings of the 2nd International Conference on Advanced Information Technologies and Applications.

[14]

L. Fu, D.-L. Goh, and S.-B. Foo. 2004. The effect of similarity measures on the quality of query clusters. Journal of Information Science 30, 5, 396--407.

[15]

L. Gang and L. Fei. 2012. Application of a clustering method on sentiment analysis. Journal of Information Science 38, 2, 127--139.

Digital Library

[16]

O. Ghanem and W. Ashour. 2012. Stemming effectiveness in clustering of arabic documents. International Journal of Computer Applications 49, 5.

[17]

A. Huang. 2008. Similarity measures for text document clustering. In Proceedings of the 6th New Zealand Computer Science Research Student Conference (NZCSRSC’08), 45--56.

[18]

L. Huang, M. Milne, E. Frank, and I. Witten. 2012. Learning a concept-based document similarity measure. Journal of the American Society for Information Science and Technology 63, 8, 1593--1608.

Digital Library

[19]

A. Jain. 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31, 651--666.

Digital Library

[20]

R. Kashef and M. S. Kamel. 2009. Enhanced bisecting k-means clustering using intermediate cooperation. Pattern Recognition 42, 11, 2557--2569.

Digital Library

[21]

S. Khoja and R. Garside. 1999. Stemming Arabic Text. Lancaster University, Department of Computer Science, Lancaster University.

[22]

K. Kishida. 2010. High-speed rough clustering for very large document collections. Journal of the American Society for Information Science and Technology 61, 6, 1092--1104.

Digital Library

[23]

L. Larkey, L. Ballesteros, and M. Connell. 2007. Light stemming for Arabic information retrieval. Arabic Computational Morphology, Speech and Language Technology 38, 221--243.

[24]

C. Manning, P. Raghavan, and H. Schütze. 2008. Introduction to Information Retrieval, Vol. 1. Cambridge University Press, Cambridge.

Digital Library

[25]

K. Murugesan and C. Zhang. 2011. Hybrid bisect k-means clustering algorithm. In Proceedings of the 2011 International Conference on Business Computing and Global Informatization (BCGIN’11). IEEE, 216--219.

Digital Library

[26]

A. Newsri. 2008. Effective Retrieval Techniques for Arabic Text. Ph.D. dissertation, RMIT University, Melbourne, Australia.

[27]

J. Peña, J. Lozano, and P. Larrañaga. 1999. An empirical comparison of four initialization methods for the K-Means algorithm. Pattern Recognition Letters 20, 6.

Digital Library

[28]

G. Salton. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman, Boston.

Digital Library

[29]

G. Salton and C. Buckley. 1988. Term weighting approaches in automatic text retrieval. Information Processing and Management 24, 5, 513--523.

Digital Library

[30]

M. Steinbach, G. Karypis, and V. Kumar. 2000. A comparison of document clustering techniques. In Proceedings of the KDD Workshop on Text Mining.

[31]

T. Tarczynski. 2011. Document clustering-concepts, metrics and algorithms. International Journal of Electronics and Telecommunications 57, 3, 271--277.

[32]

R. Xu and D. Wunsch. 2005. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16, 645--678.

Digital Library

Cited By

Chen KShi XZhang ZChen SMa JZheng TAlfonso L(2024)Using unsupervised learning to classify inlet water for more stable design of water reuse in industrial parksWater Science & Technology10.2166/wst.2024.08789:7(1757-1770)Online publication date: 19-Mar-2024
https://doi.org/10.2166/wst.2024.087
Yue XWang ZIshibashi RKaneko HMeng L(2024)An unsupervised automatic organization method for Professor Shirakawa’s hand-notated documents of oracle bone inscriptionsInternational Journal on Document Analysis and Recognition10.1007/s10032-024-00463-027:4(583-601)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1007/s10032-024-00463-0
Markatou MKennedy OBrachmann MMukhopadhyay RDharia ATalal A(2023)Social determinants of health derived from people with opioid use disorder: Improving data collection, integration and use with cross-domain collaboration and reproducible, data-centric, notebook-style workflowsFrontiers in Medicine10.3389/fmed.2023.107679410Online publication date: 2-Mar-2023
https://doi.org/10.3389/fmed.2023.1076794
Show More Cited By

Index Terms

Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
      2. Information extraction
  2. Information systems applications
    1. Data mining
      1. Clustering
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Machine learning theory
      1. Unsupervised learning and clustering

Recommendations

Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

In this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Initializing K-means Clustering Using Affinity Propagation
HIS '09: Proceedings of the 2009 Ninth International Conference on Hybrid Intelligent Systems - Volume 01

K-means clustering is widely used due to its fast convergence, but it is sensitive to the initial condition.Therefore, many methods of initializing K-means clustering have been proposed in the literatures. Compared with Kmeans clustering, a novel ...
Ensemble-Initialized k-Means Clustering
ICMLC '19: Proceedings of the 2019 11th International Conference on Machine Learning and Computing

As one of the most classical clustering techniques, the k-means clustering has been widely used in various areas over the past few decades. Despite its significant success, there are still several challenging issues in the k-means clustering research, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 15, Issue 3

March 2016

220 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/2876004

Editor:
Richard Sproat
Google, Inc., USA

Issue’s Table of Contents

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 January 2016

Accepted: 01 August 2015

Revised: 01 May 2015

Received: 01 December 2014

Published in TALLIP Volume 15, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
314
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)3

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen KShi XZhang ZChen SMa JZheng TAlfonso L(2024)Using unsupervised learning to classify inlet water for more stable design of water reuse in industrial parksWater Science & Technology10.2166/wst.2024.08789:7(1757-1770)Online publication date: 19-Mar-2024
https://doi.org/10.2166/wst.2024.087
Yue XWang ZIshibashi RKaneko HMeng L(2024)An unsupervised automatic organization method for Professor Shirakawa’s hand-notated documents of oracle bone inscriptionsInternational Journal on Document Analysis and Recognition10.1007/s10032-024-00463-027:4(583-601)Online publication date: 1-Dec-2024
https://dl.acm.org/doi/10.1007/s10032-024-00463-0
Markatou MKennedy OBrachmann MMukhopadhyay RDharia ATalal A(2023)Social determinants of health derived from people with opioid use disorder: Improving data collection, integration and use with cross-domain collaboration and reproducible, data-centric, notebook-style workflowsFrontiers in Medicine10.3389/fmed.2023.107679410Online publication date: 2-Mar-2023
https://doi.org/10.3389/fmed.2023.1076794
Dhar AMukherjee HRoy KSantosh KDash N(2023)Hybrid approach for text categorizationJournal of Information Science10.1177/0165551521102777049:3(762-777)Online publication date: 1-Jun-2023
https://dl.acm.org/doi/10.1177/01655515211027770
Ma TDu XWu XZhou ZZheng YJin C(2023)Reading Scene Text with Aggregated Temporal Convolutional EncoderACM Transactions on Asian and Low-Resource Language Information Processing10.1145/362582222:11(1-16)Online publication date: 12-Oct-2023
https://dl.acm.org/doi/10.1145/3625822
Rashidov AAkhatov ANazarov F(2023)The Same Size Distribution of Data Based on Unsupervised Clustering AlgorithmsAdvances in Artificial Systems for Logistics Engineering III10.1007/978-3-031-36115-9_40(437-447)Online publication date: 16-Jul-2023
https://doi.org/10.1007/978-3-031-36115-9_40
Dounia RYasssine KNoureddine E(2022)Exploring text representation impact on K-means based arabic text documents clustering2022 International Conference on Intelligent Systems and Computer Vision (ISCV)10.1109/ISCV54655.2022.9806067(1-5)Online publication date: 18-May-2022
https://doi.org/10.1109/ISCV54655.2022.9806067
Salman KKhafaji H(2022)Arabic Document Clustering: A Survey2022 4th International Conference on Current Research in Engineering and Science Applications (ICCRESA)10.1109/ICCRESA57091.2022.10352511(59-64)Online publication date: 20-Dec-2022
https://doi.org/10.1109/ICCRESA57091.2022.10352511
Parlina ARamli KMurfi H(2021)Exposing Emerging Trends in Smart Sustainable City Research Using Deep Autoencoders-Based Fuzzy C-MeansSustainability10.3390/su1305287613:5(2876)Online publication date: 7-Mar-2021
https://doi.org/10.3390/su13052876
Alruily M(2021)Classification of Arabic Tweets: A ReviewElectronics10.3390/electronics1010114310:10(1143)Online publication date: 12-May-2021
https://doi.org/10.3390/electronics10101143
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents