[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Latent Topic Model for Indexing Arabic Documents

Published: 01 January 2014 Publication History

Abstract

In this paper, the authors present latent topic model to index and represent the Arabic text documents reflecting more semantics. Text representation in a language with high inflectional morphology such as Arabic is not a trivial task and requires some special treatments. The authors describe our approach for analyzing and preprocessing Arabic text then we describe the stemming process. Finally, the latent model (LDA) is adapted to extract Arabic latent topics, the authors extracted significant topics of all texts, each theme is described by a particular distribution of descriptors then each text is represented on the vectors of these topics. The experiment of classification is conducted on in house corpus; latent topics are learned with LDA for different topic numbers K (25, 50, 75, and 100) then the authors compare this result with classification in the full words space. The results show that performances, in terms of precision, recall and f-measure, of classification in the reduced topics space outperform classification in full words space and when using LSI reduction.

References

[1]
Al-FedaghiS.Al-AnziF. (1989). A new algorithm to generate Arabic root-pattern forms. In Proceedings of the 11th national Computer Conference and Exhibition (pp. 391-400).
[2]
Al-ShalabiR.EvensM. (1998). A computational morphology system for Arabic. In Proceedings of the Workshop on Computational Approaches to Semitic Languages (pp. 66-72). 10.3115/1621753.1621765
[3]
Aldous, D. (1985). Exchangeability and related topics. In E'cole d'e'te¿ de probabilite's de aint-Flour, XIII-1983 (pp. 1-198). Berlin, Germany: Springer.
[4]
AljlaylM.FriederO. (2002). On Arabic search: improving the retrieval effectiveness via a light temming approach. In Proceedings of the ACM CIKM 2002 International Conference on Information and Knowledge Management, McLean, VA (pp. 340-347).
[5]
AyadiR.MaraouiM.ZriguiM. (2009). Intertextual distance for Arabic texts classification. In Proceedings of the International Conference for Internet Technology and Secured Transactions, London, UK (pp. 1-6).
[6]
Ayadi, R., Maraoui, M., & Zrigui, M. (2011). SCAT: A system of classification for Arabic texts. Int. J. of Internet Technology and Secured Transactions, 3(1), 63-80.
[7]
Baeza, Y., Ricardo, A., & Berthier, R. N. (1999). Modern information retrieval. Boston, MA: Addison-Wesley Longman.
[8]
Barnard, K., Duygulu, P., de Freitas, N., Forsyth, D., Blei, D., & Jordan, M. (2003). Matching words and pictures. Journal of Machine Learning Research, 3, 1107-1135.
[9]
Berry, M. W. (1992). Large-scale sparse singular value computations. The International Journal of Supercomputer Applications, 6(1), 13-49.
[10]
Berry, M. W., et al. (1993). SVDPACKC: Version 1.0 user's guide. Tech Rep. CS-93-194, University of Tennessee.
[11]
Blei, D., & Lafferty, J. (2006). In Weiss, Y., Schöölkopf, B., & Platt, J. (Eds.), Advances in neural information processing systems: Vol. 18. Correlated topic models. Cambridge, MA: MIT Press.
[12]
Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993-1022.
[13]
Buckwalter, T. (2002). BuckwalterArabic morphological analyzer. Retrieved from http://www.qamus.org/morphology.htm
[14]
Chen, A., & Gey, F. (2002). Building an Arabic stemmer for information retrieval. In Proceedings of the 11th Text Retrieval Conference (TREC 2002) (pp. 631-639). Gaithersburg, MD: NIST.
[15]
Deerwester, S., Dumais, T., Furnas, W. G., Landauer, T. K., & Richard, H. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science American Society for Information Science, 41(6), 391-407.
[16]
Duwairi, R., Al-Refai, M., & Khasawneh, N. (2009). Feature reduction techniques for Arabic text categorization. Journal of the American Society for Information Science American Society for Information Science, 60(11), 2347-2352.
[17]
Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3(1), 1533-7928.
[18]
Fouzi, H. EI-Qawasmah, E., & Abdul Malik, S. (2010). Comparing dimension reduction techniques for Arabic text classification using BPNN algorithm. In Proceedings of the First International Conference on Integrated Intelligent Computing (pp. 6-11).
[19]
Griffiths, T., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(suppl. 1), 5228-5235. 14872004.
[20]
HofmannH. (1999). Probabilistic latent semantic indexing. In Proceedings of the Twenty-Second Annual International SIGIR Conference (pp. 35-44).
[21]
JoachimsT. (1998). Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning, Berlin, Germany (pp. 137-142). 10.1007/BFb0026683
[22]
Joachims, T. (1999). Making large-scale SVM learning practical. In Advances in kernel methods - support vector learning. MIT Press.
[23]
Kakkonen, T., Myller, N., Sutinen, E., & Timonen, J. (2008). Comparison of dimension reduction methods for automated essay grading. Journal of Educational Technology & Society, 11(3), 275-288.
[24]
Khoja, S. (1999). Stemming Arabic text. Retrieved from http://zeus.cs.pacificu.edu/shereen/research.htm
[25]
LarkeyL.BallesterosL.ConnellM. E. (2002). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In Proceedings of the 25th Annual International Conference on Research and Development in Information Retrieval (pp. 275-282). ACM. 10.1145/564422.564425
[26]
LarkeyL.ConnellM. E. (2001). Arabic information retrieval at UMass in TREC-10. In Proceedings of the Tenth Text Retrieval Conference (TREC-10) (pp. 562-570), Gaithersburg, MD.
[27]
LiW.McCallumA. (2006). Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the International Conference on Machine Learning (ICML). 10.1145/1143844.1143917
[28]
Malisiewicz, T. J., Huang, J. C., & Efros, A. A. (2006). Detecting objects via multiple segmentations and latent topic models. Technical report. Carnegie Mellon University.
[29]
MaraouiM.AntoniadisG.ZriguiM. (2009). CALL system for Arabic based on natural language processing tools. In Proceedings of the 4th Indian International Conference on Artificial Intelligence (pp. 2249-2258).
[30]
MimnoD.McCallumA. (2007). Expertise modeling for matching papers with reviewers. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY (pp. 500-509). 10.1145/1281192.1281247
[31]
MimnoD.McCallumA. (2007). Organizing the OCA: Learning faceted subjects from a library of digital books. In Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 376-385). 10.1145/1255175.1255249
[32]
Mustafa, S. H., & Al-Radaideh, Q. A. (2004). Using N-grams for Arabic text searching. Journal of the American Society for Information Science and Technology, 11(55), 1002-1007.
[33]
Phan, X. H., & Nguyen, C. T. (2007). GibbsLDA++: A C/C++ implementation of latent Dirichlet allocation (LDA). Retrieved from http://gibbslda.sourceforge.net/
[34]
RogatiM.YangY. (2002). High-performing feature selection for text classification. In DavidG.KalpakisK.SajdaQ.HanD.LenS. (Eds.), Proc. of the 11th ACM Int'l Conf. on Information and Knowledge Management (CIKM-02) (pp. 659-661). McLean: ACM Press.
[35]
Saad, E. M., Awadalla, M. H., & Alajmi, A. F. (2011). Dewy index based Arabic document classification with synonyms merge feature reduction. International Journal of Computer Science Issues, 8(6), 46-54.
[36]
SaidD.WanasN.DarwishN.HegazyN. (2009). A study of Arabic text preprocessing methods for text categorization. In Proceedings of the 2nd Int. Conf. on Arabic Language Resources and Tools, Cairo, Egypt.
[37]
Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620.
[38]
Thabtah, F., Hadi, W., & Musa Al-Shammare, G. (2008). VSMs with k-nearest neighbour to categorize Arabic text data. In the Proc. of the World Congress on Engineering and Computer Science, WCECS, San Francisco, CA.
[39]
Wahbeh, A., Al-Kabi, M., Al-Radaideh, Q., Al-Shawakfa, E., & Alsmadi, I. (2011). The effect of stemming on Arabic text classification: An empirical study. {IJIRR}. International Journal of Information Retrieval Research, 1(3), 54-70.
[40]
Yamamoto, M., & Sadamitsu, K. (2005). Dirichlet mixtures in text modeling. CS Technical report CS-TR-05-1, University of Tsukuba.
[41]
YangY.PedersenJ. O. (1997). A comparative study on feature selection in text categorization. In FisherD. H. (Ed.), Proc. of the 14th Int'l Conf. on Machine Learning (ICML-97) (pp. 412-420). Nashville, TN: Morgan Kaufmann Publishers.
[42]
Zouaghi, A., Zrigui, M., & Antoniadis, G. (2008). Automatic understanding of spontaneous Arabic speech. A Numerical Model . TAL, 49(1), 141-166.
[43]
Zrigui, M. (2008). Contribution au traitement automatique de l'arabe. HDR, University of STENDHAL-Grenoble 3, France.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image International Journal of Information Retrieval Research
International Journal of Information Retrieval Research  Volume 4, Issue 1
January 2014
85 pages
ISSN:2155-6377
EISSN:2155-6385
Issue’s Table of Contents

Publisher

IGI Global

United States

Publication History

Published: 01 January 2014

Author Tags

  1. Arabic Text Classification
  2. LDA
  3. LSI
  4. Latent Topic Model
  5. Preprocessing Data
  6. SVM
  7. Stemming
  8. Text Representation

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media