[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1838002.1838025acmotherconferencesArticle/Chapter ViewAbstractPublication PagesfitConference Proceedingsconference-collections
research-article

Urdu text classification

Published: 16 December 2009 Publication History

Abstract

This paper compares statistical techniques for text classification using Naïve Bayes and Support Vector Machines, in context of Urdu language. A large corpus is used for training and testing purpose of the classifiers. However, those classifiers cannot directly interpret the raw dataset, so language specific preprocessing techniques are applied on it to generate a standardized and reduced-feature lexicon. Urdu language is morphological rich language which makes those tasks complex. Statistical characteristics of corpus and lexicon are measured which show satisfactory results of text preprocessing module. The empirical results show that Support Vector Machines outperform Naïve Bayes classifier in terms of classification accuracy.

References

[1]
Zhang, H. 2004. The Optimality of Naive Bayes. In: Proceedings of 17th International FLAIRS Conference, Florida, USA.
[2]
Rish, I. 2001. An empirical study of the naive Bayes classifier. In: Proceedings IJCAI Workshop on Empirical Methods in Artificial Intelligence, Seattle, USA.
[3]
Ijaz, M. and Hussain, S. 2007. Corpus Based Urdu Lexicon Development. In: Proceedings of Conference on Language Technology (CLT07), Peshawar, Pakistan.
[4]
Lowd, D., and Domingos, P. 2005. Naive Bayes Models for Probability Estimation. In: Proceedings of ICML, Germany.
[5]
Dai, W., Xue, G. R., Yang, Q., and Yu, Y. 2007. Transferring Naive Bayes Classifiers for Text Classification. In: Proceedings of 22nd AAAI Conference on Artificial Intelligence, British Columbia, USA.
[6]
Joachims, T. 2001. A Statistical Learning Model of Text Classification for Support Vector Machines. In: Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR), New Orleans, USA.
[7]
Zhuang, D., Zhang, B., Yang, Q., Yan, J., Chen, Z., and Chen Y. 2005. Efficient Text Classification by Weighted Proximal SVM. In: Proceedings of International Conference on Data Mining (ICDM), Houston, Texas, USA.
[8]
Joachims, T. 2005. A Support Vector Method for Multivariate Performance Measures. In: Proceedings of the 22nd International Conference on Machine Learning (ICML), Bonn, Germany.
[9]
Joachims, T. 1998. Text Categorization with Support Vector Machines: Learning with many Relevant Features. In: Proceedings of ECML-98, 10th European Conference on Machine Learning, Dorint-Parkhotel, Chemnitz, Germany.
[10]
Lee, Y., Lin, Y., and Wahba, G. 2001. Multicategory Support Vector Machines. In: Proceedings of Computing Science and Statistics Vol. 33, the Interface Foundation, California, USA.
[11]
Al-Harbi, S., Almuhareb, A., Al-Thubaity, A., Khorsheed, M. S., and Al-Rajeh, A. 2008. Automatic Arabic Text Classification. In: Proceedings of Actes JADT'2008 en ligne.
[12]
Joachims, T., Hamza, T., and Noaman, H. M. 1997. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. In: Proceedings of the 14th International Conference on Machine Learning, TN, USA.
[13]
Manning, C. D., Raghavan, P., and Schtuze, H. 2008. An Introduction to Information Retrieval, Cambridge University Press.
[14]
Jurafsky, D., and James, M. H. 2000. Speech and Language Processing, Prentice Hall.

Cited By

View all
  • (2024)Naïve Bayes Approach for Word Sense Disambiguation System With a Focus on Parts-of-Speech Ambiguity ResolutionIEEE Access10.1109/ACCESS.2024.345391212(126668-126678)Online publication date: 2024
  • (2024)Medical assistant chatbot Urdu text sentiment analysisHuman-Intelligent Systems Integration10.1007/s42454-024-00059-3Online publication date: 22-Nov-2024
  • (2023)EnML: Multi-label Ensemble Learning for Urdu Text ClassificationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/361611122:9(1-31)Online publication date: 22-Sep-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
FIT '09: Proceedings of the 7th International Conference on Frontiers of Information Technology
December 2009
446 pages
ISBN:9781605586427
DOI:10.1145/1838002
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

  • COMSATS Institute of Information Technology

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 December 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Naïve Bayes
  2. Urdu
  3. corpus
  4. feature selection
  5. information retrieval
  6. lexicon
  7. normalization
  8. text classification
  9. text mining

Qualifiers

  • Research-article

Conference

FIT '09
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)2
Reflects downloads up to 16 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Naïve Bayes Approach for Word Sense Disambiguation System With a Focus on Parts-of-Speech Ambiguity ResolutionIEEE Access10.1109/ACCESS.2024.345391212(126668-126678)Online publication date: 2024
  • (2024)Medical assistant chatbot Urdu text sentiment analysisHuman-Intelligent Systems Integration10.1007/s42454-024-00059-3Online publication date: 22-Nov-2024
  • (2023)EnML: Multi-label Ensemble Learning for Urdu Text ClassificationACM Transactions on Asian and Low-Resource Language Information Processing10.1145/361611122:9(1-31)Online publication date: 22-Sep-2023
  • (2023)Analysis of Cursive Text Recognition Systems: A Systematic Literature ReviewACM Transactions on Asian and Low-Resource Language Information Processing10.1145/359260022:7(1-30)Online publication date: 20-Jul-2023
  • (2023)A Multi-Kernel Optimized Convolutional Neural Network With Urdu Word Embedding to Detect Fake NewsIEEE Access10.1109/ACCESS.2023.334187011(142371-142382)Online publication date: 2023
  • (2023)A 2-Tier Bengali Dataset for Evaluation of Hard and Soft Classification ApproachesIETE Journal of Research10.1080/03772063.2023.217367270:3(2430-2452)Online publication date: 20-Feb-2023
  • (2023)Effect of Stopwords and Stemming Techniques in Urdu IRSN Computer Science10.1007/s42979-023-01953-44:5Online publication date: 29-Jul-2023
  • (2022)Multi-class sentiment analysis of urdu text using multilingual BERTScientific Reports10.1038/s41598-022-09381-912:1Online publication date: 31-Mar-2022
  • (2022)Telugu Text Classification Using Supervised Machine Learning AlgorithmSmart Intelligent Computing and Applications, Volume 110.1007/978-981-16-9669-5_27(293-305)Online publication date: 19-Apr-2022
  • (2022)Deep Convolutional Neural Network Approach for Classification of PoemsIntelligent Human Computer Interaction10.1007/978-3-030-98404-5_7(74-88)Online publication date: 20-Mar-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media