More Web Proxy on the site http://driver.im/

research-article

K-means, HAC and FCM Which Clustering Approach for Arabic Text?

Authors:

Lahbib Ajallouda,

Fatima Zahra Fagroud,

El Habib BenlahmarAuthors Info & Claims

SITA'20: Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications

Article No.: 29, Pages 1 - 8

https://doi.org/10.1145/3419604.3419779

Published: 08 November 2020 Publication History

Abstract

Today, we are witnessing rapid growth in Web resources that allow Internet users to express and share their ideas, opinions, and judgments on a variety of issues. Several classification approaches have been proposed to classify textual data. But all these approaches require us to label the clusters we want to obtain. Which, in reality, is not available because we do not know in advance the information that can be proposed through these opinions. To overcome this constraint, clustering approaches such as K-mean, HAC or FCM can be exploited. In this paper, we present and compare these approaches. And to show the importance of exploiting clustering algorithms, to classify and analyze textual data in Arabic. By applying them to a real case that has created a great debate in Morocco, which is the case of teachers contracting with academies.

References

[1]

Goutam Chakraborty, Murali Pagolu, and Satish Garla. 2014. Text mining and analysis: practical methods, examples, and case studies using SAS. SAS Institute.

Digital Library

[2]

Patricia S. Abril and Robert Plant. 1979. Speaker-independent recognition of isolated words using clustering techniques. IEEE Transactions on Acoustics, Speech, and Signal Processing 27, 4 (Aug. 1979), 336--349.

[3]

J. E. Doran F. R. Hodson, P. H. A. Sneath. 1966. Some experiments in the numerical analysis of archaeological data. Biometrika 53, 3-4 (1966), 311--324.

[4]

Jon R Kettenring, William H Rogers, Martin E Smith, and Jack L Warner. 1976. Cluster analysis applied to the validation of course objectives. Journal of Educational Statistics 1, 1 (1976), 39--57.

[5]

Inderjit S Dhillon, James Fan, and Yuqiang Guan. 2001. Efcient clustering of very large document collections. In Data mining for scientifc and engineering applications. Springer, 357--381.

[6]

Michael Steinbach George Karypis, Vipin Kumar, and Michael Steinbach. 2000. A comparison of document clustering techniques. In TextMining Workshop at KDD2000 (May 2000).

[7]

Ram Gnanadesikan, Jon R Kettenring, and James M Landwehr. 1977. Interpreting and assessing the results of cluster analyses. Bulletin of the International Statistical Institute 47, 2 (1977), 451--463.

[8]

Yasser Saissi, Ahmed Zellou, and Ali Idri. 2018. A new clustering approach to identify the values to query the deep web access forms. In 2018 4th International Conference on Computer and Technology Applications (ICCTA) (Istanbul, Turkey). IEEE, 111--116.

[9]

Anna Huang. 2008. Similarity measures for text document clustering. In Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand (Christchurch, New Zealand). 49--56.

[10]

Raihana Ferdous et al. 2009. An efcient k-means algorithm integrated with Jaccard distance measure for document clustering. In 2009 First Asian Himalayas International Conference on Internet (Kathmandu, Nepal). IEEE, 1--6.

[11]

Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 1999. Modern information retrieval. Vol. 463. ACM press New York.

Digital Library

[12]

Janmenjoy Nayak, Bighnaraj Naik, and HSr Behera. 2015. Fuzzy C-means (FCM) clustering algorithm: a decade review from 2000 to 2014. In Computational intelligence in data mining-volume 2. Springer, 133--149.

[13]

Osama Abu Abbas. 2008. Comparisons Between Data Clustering Algorithms. International Arab Journal of Information Technology (IAJIT) 5, 3 (2008), 133--149.

[14]

Mounzer Boubou. 2007. Contribution aux méthodes de classifcation non supervisée via des approches prétopologiques et d'agrégation d'opinions. Ph.D. Dissertation. Université Claude Bernard - Lyon I. AAT 8506171.

[15]

Gengxin Chen, Saied A Jaradat, Nila Banerjee, Tetsuya S Tanaka, Minoru SH Ko, and Michael Q Zhang. 2002. Evaluation and comparison of clustering algorithms in analyzing ES cell gene expression data. Statistica Sinica 12, 1 (2002), 241--262.

[16]

Maryam Bakhshi, Mohammad-Reza Feizi-Derakhshi, and E Zafarani. 2012. Review and comparison between clustering algorithms with duplicate entities detection purpose. International Journal of Computer Science & Emerging Technologies 3, 3 (2012), 108--114.

[17]

Abdelkarim Ben Ayed, Mohamed Ben Halima, and Adel M Alimi. 2014. Survey on clustering methods: Towards fuzzy clustering for big data. In 2014 6th International Conference of Soft Computing and Pattern Recognition (SoCPaR) (Tunis, Tunisia). IEEE, 331--336.

[18]

Abla Chouni Benabdellah, Asmaa Benghabrit, and Imane Bouhaddou. 2019. A survey of clustering algorithms for an industrial context. Procedia computer science 148 (2019), 291--302.

[19]

Benjamin Schelling and Claudia Plant. 2018. KMN-Removing Noise from K-Means Clustering Results. In International Conference on Big Data Analytics and Knowledge Discovery. Springer, 137--151.

Digital Library

[20]

Abu Bakr Soliman, Kareem Eissa, and Samhaa R El-Beltagy. 2017. Aravec: A set of arabic word embedding models for use in arabic nlp. Procedia Computer Science 117 (2017), 256--265.

[21]

Sathees Kumar and R Karthika. 2014. A survey on text mining process and techniques. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) 3, 7 (2014), 2279--2284.

[22]

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efcient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).

[23]

Ramzan Talib, Muhammad Kashif Hanif, Shaeela Ayesha, and Fakeeha Fatima. 2016. Text mining: techniques, applications and issues. International Journal of Advanced Computer Science and Applications 7, 11 (2016), 414--418.

[24]

AL-Shatnawi Atallah and Khairuddin Omar. 2008. Methods of arabic language baseline detection-The state of art. IJCSNS 8, 10 (2008), 137.

[25]

Iskandar Keskes, Farah Benamara, and Lamia Hadrich Belguith. 2013. Segmenting Arabic Texts into Elementary Discourse Units (Segmentation de textes arabes en unités discursives minimales)[in French]. In Proceedings of TALN 2013 (Volume 1: Long Papers). 435--449.

[26]

Shereen Khoja and Roger Garside. 1999. Stemming arabic text. Lancaster, UK, Computing Department, Lancaster University (1999).

[27]

Rehab Duwairi, Mohammad Al-Refai, and Natheer Khasawneh. 2007. Stemming versus light stemming as feature selection techniques for Arabic text categorization. In 2007 Innovations in Information Technologies (IIT). IEEE, 446--450.

[28]

George W Adamson and Jillian Boreham. 1974. The use of an association measure based on character structure to identify semantically related pairs of words and document titles. Information storage and retrieval 10, 7-8 (1974), 253--260.

[29]

Abdullah Wahbeh, Mohammed Al-Kabi, Qasem Al-Radaideh, Emad Al-Shawakfa, and Izzat Alsmadi. 2011. The effect of stemming on Arabic text classifcation: an empirical study. International Journal of Information Retrieval Research (IJIRR) 1, 3 (2011), 54--70.

Digital Library

[30]

Karen Sparck Jones. 1972. A statistical interpretation of term specifcity and its application in retrieval. Journal of documentation (1972).

[31]

Charrad Malika, N Ghazzali, V Boiteau, and A Niknafs. 2014. NbClust: an R package for determining the relevant number of clusters in a data Set. J. Stat. Softw 61 (2014), 1--36.

[32]

Alboukadel Kassambara. 2017. Determining the optimal number of clusters: 3 must know methods. Available onli ne: https://www.datanovia.com/en/lessons/determiningthe-optimal-number-of-clusters-3-must-know-methods/.(accessed on 31 April 2018) (2017).

Cited By

Rassam LRaoui MZellou AEl Yazidi M(2022)Analyzing Textual Documents Indexes by Applying Key-Phrases Extraction in Fuzzy Logic Domain Based on A Graphical Indexing Methodology2022 International Conference on Computational Modelling, Simulation and Optimization (ICCMSO)10.1109/ICCMSO58359.2022.00035(122-126)Online publication date: Dec-2022
https://doi.org/10.1109/ICCMSO58359.2022.00035
Rassam LAldiebesghanem CZellou ABen Lahmar E(2022)Fuzzy Logic-based N-gram Graph Technique for Evaluating Textual Documents Indexes2022 4th International Conference on Computer Communication and the Internet (ICCCI)10.1109/ICCCI55554.2022.9850268(78-82)Online publication date: 1-Jul-2022
https://doi.org/10.1109/ICCCI55554.2022.9850268

Index Terms

K-means, HAC and FCM Which Clustering Approach for Arabic Text?
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning

Recommendations

A dissimilarity measure based Fuzzy c-means FCM clustering algorithm

According to the definition of cluster objects belonging to same cluster must have high similarity while objects belonging to different clusters should be highly dissimilar. In the same way cluster validity indices for analyzing clustering result are ...
Ant clustering algorithm with K-harmonic means clustering

Clustering is an unsupervised learning procedure and there is no a prior knowledge of data distribution. It organizes a set of objects/data into similar groups called clusters, and the objects within one cluster are highly similar and dissimilar with ...
Effect of cluster size distribution on clustering: a comparative study of k-means and fuzzy c-means clustering
Abstract
Data distribution has a significant impact on clustering results. This study focuses on the effect of cluster size distribution on clustering, namely the uniform effect of k-means and fuzzy c-means (FCM) clustering. We first provide some related ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

SITA'20: Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications

September 2020

333 pages

ISBN:9781450377331

DOI:10.1145/3419604

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 November 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SITA'20

SITA'20: Theories and Applications

September 23 - 24, 2020

Rabat, Morocco

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
29
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Rassam LRaoui MZellou AEl Yazidi M(2022)Analyzing Textual Documents Indexes by Applying Key-Phrases Extraction in Fuzzy Logic Domain Based on A Graphical Indexing Methodology2022 International Conference on Computational Modelling, Simulation and Optimization (ICCMSO)10.1109/ICCMSO58359.2022.00035(122-126)Online publication date: Dec-2022
https://doi.org/10.1109/ICCMSO58359.2022.00035
Rassam LAldiebesghanem CZellou ABen Lahmar E(2022)Fuzzy Logic-based N-gram Graph Technique for Evaluating Textual Documents Indexes2022 4th International Conference on Computer Communication and the Internet (ICCCI)10.1109/ICCCI55554.2022.9850268(78-82)Online publication date: 1-Jul-2022
https://doi.org/10.1109/ICCCI55554.2022.9850268

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents