[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3419604.3419779acmotherconferencesArticle/Chapter ViewAbstractPublication PagessitaConference Proceedingsconference-collections
research-article

K-means, HAC and FCM Which Clustering Approach for Arabic Text?

Published: 08 November 2020 Publication History

Abstract

Today, we are witnessing rapid growth in Web resources that allow Internet users to express and share their ideas, opinions, and judgments on a variety of issues. Several classification approaches have been proposed to classify textual data. But all these approaches require us to label the clusters we want to obtain. Which, in reality, is not available because we do not know in advance the information that can be proposed through these opinions. To overcome this constraint, clustering approaches such as K-mean, HAC or FCM can be exploited. In this paper, we present and compare these approaches. And to show the importance of exploiting clustering algorithms, to classify and analyze textual data in Arabic. By applying them to a real case that has created a great debate in Morocco, which is the case of teachers contracting with academies.

References

[1]
Goutam Chakraborty, Murali Pagolu, and Satish Garla. 2014. Text mining and analysis: practical methods, examples, and case studies using SAS. SAS Institute.
[2]
Patricia S. Abril and Robert Plant. 1979. Speaker-independent recognition of isolated words using clustering techniques. IEEE Transactions on Acoustics, Speech, and Signal Processing 27, 4 (Aug. 1979), 336--349.
[3]
J. E. Doran F. R. Hodson, P. H. A. Sneath. 1966. Some experiments in the numerical analysis of archaeological data. Biometrika 53, 3-4 (1966), 311--324.
[4]
Jon R Kettenring, William H Rogers, Martin E Smith, and Jack L Warner. 1976. Cluster analysis applied to the validation of course objectives. Journal of Educational Statistics 1, 1 (1976), 39--57.
[5]
Inderjit S Dhillon, James Fan, and Yuqiang Guan. 2001. Efcient clustering of very large document collections. In Data mining for scientifc and engineering applications. Springer, 357--381.
[6]
Michael Steinbach George Karypis, Vipin Kumar, and Michael Steinbach. 2000. A comparison of document clustering techniques. In TextMining Workshop at KDD2000 (May 2000).
[7]
Ram Gnanadesikan, Jon R Kettenring, and James M Landwehr. 1977. Interpreting and assessing the results of cluster analyses. Bulletin of the International Statistical Institute 47, 2 (1977), 451--463.
[8]
Yasser Saissi, Ahmed Zellou, and Ali Idri. 2018. A new clustering approach to identify the values to query the deep web access forms. In 2018 4th International Conference on Computer and Technology Applications (ICCTA) (Istanbul, Turkey). IEEE, 111--116.
[9]
Anna Huang. 2008. Similarity measures for text document clustering. In Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand (Christchurch, New Zealand). 49--56.
[10]
Raihana Ferdous et al. 2009. An efcient k-means algorithm integrated with Jaccard distance measure for document clustering. In 2009 First Asian Himalayas International Conference on Internet (Kathmandu, Nepal). IEEE, 1--6.
[11]
Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 1999. Modern information retrieval. Vol. 463. ACM press New York.
[12]
Janmenjoy Nayak, Bighnaraj Naik, and HSr Behera. 2015. Fuzzy C-means (FCM) clustering algorithm: a decade review from 2000 to 2014. In Computational intelligence in data mining-volume 2. Springer, 133--149.
[13]
Osama Abu Abbas. 2008. Comparisons Between Data Clustering Algorithms. International Arab Journal of Information Technology (IAJIT) 5, 3 (2008), 133--149.
[14]
Mounzer Boubou. 2007. Contribution aux méthodes de classifcation non supervisée via des approches prétopologiques et d'agrégation d'opinions. Ph.D. Dissertation. Université Claude Bernard - Lyon I. AAT 8506171.
[15]
Gengxin Chen, Saied A Jaradat, Nila Banerjee, Tetsuya S Tanaka, Minoru SH Ko, and Michael Q Zhang. 2002. Evaluation and comparison of clustering algorithms in analyzing ES cell gene expression data. Statistica Sinica 12, 1 (2002), 241--262.
[16]
Maryam Bakhshi, Mohammad-Reza Feizi-Derakhshi, and E Zafarani. 2012. Review and comparison between clustering algorithms with duplicate entities detection purpose. International Journal of Computer Science & Emerging Technologies 3, 3 (2012), 108--114.
[17]
Abdelkarim Ben Ayed, Mohamed Ben Halima, and Adel M Alimi. 2014. Survey on clustering methods: Towards fuzzy clustering for big data. In 2014 6th International Conference of Soft Computing and Pattern Recognition (SoCPaR) (Tunis, Tunisia). IEEE, 331--336.
[18]
Abla Chouni Benabdellah, Asmaa Benghabrit, and Imane Bouhaddou. 2019. A survey of clustering algorithms for an industrial context. Procedia computer science 148 (2019), 291--302.
[19]
Benjamin Schelling and Claudia Plant. 2018. KMN-Removing Noise from K-Means Clustering Results. In International Conference on Big Data Analytics and Knowledge Discovery. Springer, 137--151.
[20]
Abu Bakr Soliman, Kareem Eissa, and Samhaa R El-Beltagy. 2017. Aravec: A set of arabic word embedding models for use in arabic nlp. Procedia Computer Science 117 (2017), 256--265.
[21]
Sathees Kumar and R Karthika. 2014. A survey on text mining process and techniques. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) 3, 7 (2014), 2279--2284.
[22]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efcient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[23]
Ramzan Talib, Muhammad Kashif Hanif, Shaeela Ayesha, and Fakeeha Fatima. 2016. Text mining: techniques, applications and issues. International Journal of Advanced Computer Science and Applications 7, 11 (2016), 414--418.
[24]
AL-Shatnawi Atallah and Khairuddin Omar. 2008. Methods of arabic language baseline detection-The state of art. IJCSNS 8, 10 (2008), 137.
[25]
Iskandar Keskes, Farah Benamara, and Lamia Hadrich Belguith. 2013. Segmenting Arabic Texts into Elementary Discourse Units (Segmentation de textes arabes en unités discursives minimales)[in French]. In Proceedings of TALN 2013 (Volume 1: Long Papers). 435--449.
[26]
Shereen Khoja and Roger Garside. 1999. Stemming arabic text. Lancaster, UK, Computing Department, Lancaster University (1999).
[27]
Rehab Duwairi, Mohammad Al-Refai, and Natheer Khasawneh. 2007. Stemming versus light stemming as feature selection techniques for Arabic text categorization. In 2007 Innovations in Information Technologies (IIT). IEEE, 446--450.
[28]
George W Adamson and Jillian Boreham. 1974. The use of an association measure based on character structure to identify semantically related pairs of words and document titles. Information storage and retrieval 10, 7-8 (1974), 253--260.
[29]
Abdullah Wahbeh, Mohammed Al-Kabi, Qasem Al-Radaideh, Emad Al-Shawakfa, and Izzat Alsmadi. 2011. The effect of stemming on Arabic text classifcation: an empirical study. International Journal of Information Retrieval Research (IJIRR) 1, 3 (2011), 54--70.
[30]
Karen Sparck Jones. 1972. A statistical interpretation of term specifcity and its application in retrieval. Journal of documentation (1972).
[31]
Charrad Malika, N Ghazzali, V Boiteau, and A Niknafs. 2014. NbClust: an R package for determining the relevant number of clusters in a data Set. J. Stat. Softw 61 (2014), 1--36.
[32]
Alboukadel Kassambara. 2017. Determining the optimal number of clusters: 3 must know methods. Available onli ne: https://www.datanovia.com/en/lessons/determiningthe-optimal-number-of-clusters-3-must-know-methods/.(accessed on 31 April 2018) (2017).

Cited By

View all
  • (2022)Analyzing Textual Documents Indexes by Applying Key-Phrases Extraction in Fuzzy Logic Domain Based on A Graphical Indexing Methodology2022 International Conference on Computational Modelling, Simulation and Optimization (ICCMSO)10.1109/ICCMSO58359.2022.00035(122-126)Online publication date: Dec-2022
  • (2022)Fuzzy Logic-based N-gram Graph Technique for Evaluating Textual Documents Indexes2022 4th International Conference on Computer Communication and the Internet (ICCCI)10.1109/ICCCI55554.2022.9850268(78-82)Online publication date: 1-Jul-2022

Index Terms

  1. K-means, HAC and FCM Which Clustering Approach for Arabic Text?

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    SITA'20: Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications
    September 2020
    333 pages
    ISBN:9781450377331
    DOI:10.1145/3419604
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 November 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Arabic Text Mining
    2. Clustering
    3. Fuzzy C-Means
    4. HAC
    5. K-means

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    SITA'20
    SITA'20: Theories and Applications
    September 23 - 24, 2020
    Rabat, Morocco

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 01 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Analyzing Textual Documents Indexes by Applying Key-Phrases Extraction in Fuzzy Logic Domain Based on A Graphical Indexing Methodology2022 International Conference on Computational Modelling, Simulation and Optimization (ICCMSO)10.1109/ICCMSO58359.2022.00035(122-126)Online publication date: Dec-2022
    • (2022)Fuzzy Logic-based N-gram Graph Technique for Evaluating Textual Documents Indexes2022 4th International Conference on Computer Communication and the Internet (ICCCI)10.1109/ICCCI55554.2022.9850268(78-82)Online publication date: 1-Jul-2022

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media