[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Using Bisect K-Means Clustering Technique in the Analysis of Arabic Documents

Published: 28 January 2016 Publication History

Abstract

In this article, I have investigated the performance of the bisect K-means clustering algorithm compared to the standard K-means algorithm in the analysis of Arabic documents. The experiments included five commonly used similarity and distance functions (Pearson correlation coefficient, cosine, Jaccard coefficient, Euclidean distance, and averaged Kullback-Leibler divergence) and three leading stemmers. Using the purity measure, the bisect K-means clearly outperformed the standard K-means in all settings with varying margins. For the bisect K-means, the best purity reached 0.927 when using the Pearson correlation coefficient function, while for the standard K-means, the best purity reached 0.884 when using the Jaccard coefficient function. Removing stop words significantly improved the results of the bisect K-means but produced minor improvements in the results of the standard K-means. Stemming provided additional minor improvement in all settings except the combination of the averaged Kullback-Leibler divergence function and the root-based stemmer, where the purity was deteriorated by more than 10%. These experiments were conducted using a dataset with nine categories, each of which contains 300 documents.

References

[1]
D. Abuaiadah, J. El Sana, and W. Abusalah. 2014. On the impact of dataset characteristics on arabic document classification. International Journal of Computer Applications 101, 7, 31--38.
[2]
E. Al-Shammari and J. Lin. 2008. Towards an error-free arabic stemming. In Proceedings of the 2nd ACM workshop on Improving Non English Web Searching.
[3]
F. Archetti, P. Campanelli, E. Fersini, and E. Messina. 2006. A Hierarchical Document Clustering Environment Based on the Induced Bisecting k-Means. Springer, Berlin.
[4]
P. Berkhin. 2001. Survey of Clustering Data Mining Techniques. Retrieved from http://www.accrue.com/products/rp_cluster_review.pdf.
[5]
Q. Bsoul and M. Mohd. 2011. Effect of ISRI stemming on similarity measure for Arabic document clustering. In Proceedings of the 7th Asia Conference on Information Retrieval Technology (AIRS’11). 7097, 584--593.
[6]
R. Cathey, E. Jensen, S. Beitzel, O. Frieder, and D. Grossman. 2007. Exploiting parallelism to support scalable hierarchical clustering. Journal of the American Society for Information Science and Technology 58, 8, 1207--1221.
[7]
A. Chen and F. Gey. 2002. Building an Arabic stemmer for information retrieval. In NIST Special Publication 500-251: Proceedings of the 11th Text Retrieval Conference (TREC’02). Retrieved from http://trec.nist.gov/pubs/trec11/papers/ucalberkeley.chen.pdf.
[8]
E. Dang, R. Luk, K. Ho, S. Chan, and D. Lee. 2008. A new measure of clustering effectiveness: Algorithms and experimental studies. Journal of the American Society for Information Science and Technology 59, 3, 390--406.
[9]
T. El-Shishtawy and F. El-Ghannam. 2012. An accurate arabic root-based lemmatizer for information retrieval purposes. arXiv preprint arXiv:1203.3584.
[10]
M. Fahim, M. Salem, A. Torkey, and A. Ramadan. 2006. An efficient enhanced k-means clustering algorithm. Journal of Zhejiang University Science A 7, 10, 1626--1633.
[11]
H. Froud, A. Lachkar, and S. Ouatik. 2013a. Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering. International Journal of Data Mining & Knowledge Management Process (IJDKP). Vol. 3, 79--95.
[12]
H. Froud and A. Lachkar. 2013. Agglomerative hierarchical clustering techniques for arabic documents. In Advances in Computational Science, Engineering and Information Technology. Springer International Publishing, 255--267.
[13]
H. Froud, I. Sahmoudi, and A. Lachkar. 2013b. An efficient approach to improve arabic documents clustering based on a new keyphrases extraction algorithm. In Proceedings of the 2nd International Conference on Advanced Information Technologies and Applications.
[14]
L. Fu, D.-L. Goh, and S.-B. Foo. 2004. The effect of similarity measures on the quality of query clusters. Journal of Information Science 30, 5, 396--407.
[15]
L. Gang and L. Fei. 2012. Application of a clustering method on sentiment analysis. Journal of Information Science 38, 2, 127--139.
[16]
O. Ghanem and W. Ashour. 2012. Stemming effectiveness in clustering of arabic documents. International Journal of Computer Applications 49, 5.
[17]
A. Huang. 2008. Similarity measures for text document clustering. In Proceedings of the 6th New Zealand Computer Science Research Student Conference (NZCSRSC’08), 45--56.
[18]
L. Huang, M. Milne, E. Frank, and I. Witten. 2012. Learning a concept-based document similarity measure. Journal of the American Society for Information Science and Technology 63, 8, 1593--1608.
[19]
A. Jain. 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31, 651--666.
[20]
R. Kashef and M. S. Kamel. 2009. Enhanced bisecting k-means clustering using intermediate cooperation. Pattern Recognition 42, 11, 2557--2569.
[21]
S. Khoja and R. Garside. 1999. Stemming Arabic Text. Lancaster University, Department of Computer Science, Lancaster University.
[22]
K. Kishida. 2010. High-speed rough clustering for very large document collections. Journal of the American Society for Information Science and Technology 61, 6, 1092--1104.
[23]
L. Larkey, L. Ballesteros, and M. Connell. 2007. Light stemming for Arabic information retrieval. Arabic Computational Morphology, Speech and Language Technology 38, 221--243.
[24]
C. Manning, P. Raghavan, and H. Schütze. 2008. Introduction to Information Retrieval, Vol. 1. Cambridge University Press, Cambridge.
[25]
K. Murugesan and C. Zhang. 2011. Hybrid bisect k-means clustering algorithm. In Proceedings of the 2011 International Conference on Business Computing and Global Informatization (BCGIN’11). IEEE, 216--219.
[26]
A. Newsri. 2008. Effective Retrieval Techniques for Arabic Text. Ph.D. dissertation, RMIT University, Melbourne, Australia.
[27]
J. Peña, J. Lozano, and P. Larrañaga. 1999. An empirical comparison of four initialization methods for the K-Means algorithm. Pattern Recognition Letters 20, 6.
[28]
G. Salton. 1989. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman, Boston.
[29]
G. Salton and C. Buckley. 1988. Term weighting approaches in automatic text retrieval. Information Processing and Management 24, 5, 513--523.
[30]
M. Steinbach, G. Karypis, and V. Kumar. 2000. A comparison of document clustering techniques. In Proceedings of the KDD Workshop on Text Mining.
[31]
T. Tarczynski. 2011. Document clustering-concepts, metrics and algorithms. International Journal of Electronics and Telecommunications 57, 3, 271--277.
[32]
R. Xu and D. Wunsch. 2005. Survey of clustering algorithms. IEEE Transactions on Neural Networks 16, 645--678.

Cited By

View all
  • (2024)Using unsupervised learning to classify inlet water for more stable design of water reuse in industrial parksWater Science & Technology10.2166/wst.2024.08789:7(1757-1770)Online publication date: 19-Mar-2024
  • (2024)An unsupervised automatic organization method for Professor Shirakawa’s hand-notated documents of oracle bone inscriptionsInternational Journal on Document Analysis and Recognition10.1007/s10032-024-00463-027:4(583-601)Online publication date: 1-Dec-2024
  • (2023)Social determinants of health derived from people with opioid use disorder: Improving data collection, integration and use with cross-domain collaboration and reproducible, data-centric, notebook-style workflowsFrontiers in Medicine10.3389/fmed.2023.107679410Online publication date: 2-Mar-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 15, Issue 3
March 2016
220 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/2876004
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 January 2016
Accepted: 01 August 2015
Revised: 01 May 2015
Received: 01 December 2014
Published in TALLIP Volume 15, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Arabic stemmers
  2. Information retrieval
  3. K-means
  4. bisect K-means
  5. similarity measures

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)3
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Using unsupervised learning to classify inlet water for more stable design of water reuse in industrial parksWater Science & Technology10.2166/wst.2024.08789:7(1757-1770)Online publication date: 19-Mar-2024
  • (2024)An unsupervised automatic organization method for Professor Shirakawa’s hand-notated documents of oracle bone inscriptionsInternational Journal on Document Analysis and Recognition10.1007/s10032-024-00463-027:4(583-601)Online publication date: 1-Dec-2024
  • (2023)Social determinants of health derived from people with opioid use disorder: Improving data collection, integration and use with cross-domain collaboration and reproducible, data-centric, notebook-style workflowsFrontiers in Medicine10.3389/fmed.2023.107679410Online publication date: 2-Mar-2023
  • (2023)Hybrid approach for text categorizationJournal of Information Science10.1177/0165551521102777049:3(762-777)Online publication date: 1-Jun-2023
  • (2023)Reading Scene Text with Aggregated Temporal Convolutional EncoderACM Transactions on Asian and Low-Resource Language Information Processing10.1145/362582222:11(1-16)Online publication date: 12-Oct-2023
  • (2023)The Same Size Distribution of Data Based on Unsupervised Clustering AlgorithmsAdvances in Artificial Systems for Logistics Engineering III10.1007/978-3-031-36115-9_40(437-447)Online publication date: 16-Jul-2023
  • (2022)Exploring text representation impact on K-means based arabic text documents clustering2022 International Conference on Intelligent Systems and Computer Vision (ISCV)10.1109/ISCV54655.2022.9806067(1-5)Online publication date: 18-May-2022
  • (2022)Arabic Document Clustering: A Survey2022 4th International Conference on Current Research in Engineering and Science Applications (ICCRESA)10.1109/ICCRESA57091.2022.10352511(59-64)Online publication date: 20-Dec-2022
  • (2021)Exposing Emerging Trends in Smart Sustainable City Research Using Deep Autoencoders-Based Fuzzy C-MeansSustainability10.3390/su1305287613:5(2876)Online publication date: 7-Mar-2021
  • (2021)Classification of Arabic Tweets: A ReviewElectronics10.3390/electronics1010114310:10(1143)Online publication date: 12-May-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media