[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

An Unsupervised Approach for Sentiment Analysis on Social Media Short Text Classification in Roman Urdu

Published: 03 November 2021 Publication History

Abstract

During the last two decades, sentiment analysis, also known as opinion mining, has become one of the most explored research areas in Natural Language Processing (NLP) and data mining. Sentiment analysis focuses on the sentiments or opinions of consumers expressed over social media or different web sites. Due to exposure on the Internet, sentiment analysis has attracted vast numbers of researchers over the globe. A large amount of research has been conducted in English, Chinese, and other languages used worldwide. However, Roman Urdu has been neglected despite being the third most used language for communication in the world, covering millions of users around the globe. Although some techniques have been proposed for sentiment analysis in Roman Urdu, these techniques are limited to a specific domain or developed incorrectly due to the unavailability of language resources available for Roman Urdu. Therefore, in this article, we are proposing an unsupervised approach for sentiment analysis in Roman Urdu. First, the proposed model normalizes the text to overcome spelling variations of different words. After normalizing text, we have used Roman Urdu and English opinion lexicons to correctly identify users’ opinions from the text. We have also incorporated negation terms and stemming to assign polarities to each extracted opinion. Furthermore, our model assigns a score to each sentence on the basis of the polarities of extracted opinions and classifies each sentence as positive, negative, or neutral. In order to verify our approach, we have conducted experiments on two publicly available datasets for Roman Urdu and compared our approach with the existing model. Results have demonstrated that our approach outperforms existing models for sentiment analysis tasks in Roman Urdu. Furthermore, our approach does not suffer from domain dependency.

References

[1]
Muhammad Pervez Akhter, Zheng Jiangbin, Irfan Raza Naqvi, Mohammed Abdelmajeed, and Muhammad Tariq Sadiq. 2020. Automatic detection of offensive language for Urdu and Roman Urdu. IEEE Access 8, (2020), 91213–91226.
[2]
Abbas Raza Ali and Maliha Ijaz. 2009. Urdu text classification. In Proceedings of the 7th International Conference on Frontiers of Information Technology, Abbottabad, Pakistan. 1–7.
[3]
Mubashir Ali, Shehzad Khalid, and Muhammad Haseeb Aslam. 2017. Pattern based comprehensive Urdu stemmer and short text classification. IEEE Access 6, (2017), 7374–7389.
[4]
Ahmad Amin, Toqir A. Rana, Natash Ali Mian, Muhammad Waseem Iqbal, Abbas Khalid, Tahir Alyas, and Mohammad Tubishat. 2020. TOP-Rank: A novel unsupervised approach for topic prediction using keyphrase extraction for Urdu documents. IEEE Access 8, (2020), 212675–212686.
[5]
Muhammad Zubair Asghar, Anum Sattar, Aurangzeb Khan, Amjad Ali, Fazal Masud Kundi, and Shakeel Ahmad. 2019. Creating sentiment lexicon for sentiment analysis in Urdu: The case of a resource-poor language. Expert Systems 36, 3 (2019), e12397.
[6]
Dr Muhammad Awais and Dr Muhammad Shoaib. 2019. Role of discourse information in Urdu sentiment classification: A rule-based method and machine-learning technique. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 4 (2019), 34.
[7]
Mahmoud Al-Ayyoub, Abed Allah Khamaiseh, Yaser Jararweh, and Mohammed N Al-Kabi. 2019. A comprehensive survey of Arabic sentiment analysis. Information Processing & Management 56, 2 (2019), 320–342.
[8]
Muhammad Bilal, Huma Israr, Muhammad Shahid, and Amin Khan. 2016. Sentiment classification of Roman-Urdu opinions using Naïve Bayesian, decision tree and KNN classification techniques. Journal of King Saud University-Computer and Information Sciences 28, 3 (2016), 330–344.
[9]
Shaveta Dargan, Munish Kumar, Anupam Garg, and Kutub Thakur. 2020. Writer identification system for pre-segmented offline handwritten Devanagari characters using k-NN and SVM. Soft Computing 24 (2020), 1011–10122.
[10]
Misbah Daud, Rafiullah Khan, Aitazaz Daud, and others. 2014. Roman Urdu opinion mining system (RUOMiS). CSEIJ 4, 6 (2014), 1–9.
[11]
Hussain Ghulam, Feng Zeng, Wenjia Li, and Yutong Xiao. 2019. Deep learning-based sentiment analysis for Roman Urdu text. Procedia Computer Science 147, (2019), 131–135.
[12]
Surbhi Gupta and Munish Kumar. 2020. Forensic document examination system using boosting and bagging methodologies. Soft Computing 24, 7 (2020), 5409–5426.
[13]
Muhammad Hassan and Muhammad Shoaib. 2018. Opinion within opinion: Segmentation approach for Urdu sentiment analysis. International Arab Journal of Information Technology 15, 1 (2018), 21–28.
[14]
Ann Irvine, Jonathan Weese, and Chris Callison-Burch. 2012. Processing informal, romanized Pakistani text messages. In Proceedings of the 2nd Workshop on Language in Social Media, Association for Computational Linguistics, Montréal, Canada. 75–78.
[15]
Abdul Jabbar, Sajid Iqbal, and Muhammad Usman Ghani Khan. 2016. Analysis and development of resources for Urdu text stemming. In Proceedings of the 6th International Conference on Language and Technology, Lahore, Pakistan. 1–7.
[16]
Iqra Javed and Hammad Afzal. 2013. Opinion analysis of Bi-lingual event data from social networks. In ESSEM@ AI* IA, Citeseer, 164–172.
[17]
Rupinder Pal Kaur, M. K. Jindal, and Munish Kumar. 2021. Text and graphics segmentation of newspapers printed in Gurmukhi script: A hybrid approach. The Visual Computer 37 (2021), 1637–1659.
[18]
Abdul Rafae Khan, Asim Karim, Hassan Sajjad, Faisal Kamiran, and Jia Xu. 2020. A clustering framework for lexical normalization of roman urdu. Natural Language Engineering (2020), 1–31.
[19]
Khairullah Khan, Wahab Khan, Atta Ur Rahman, Aurangzeb Khan, Asfandyar Khan, Ashraf Ullah Khan, and Bibi Saqia. 2018. Urdu sentiment analysis. International Journal of Advanced Computer Science and Applications 9, 9 (2018), 646–651.
[20]
Khairullah Khan, Ashraf Ullah, and Baharum Baharudin. 2016. Pattern and semantic analysis to improve unsupervised techniques for opinion target identification. Kuwait Journal of Science 43, 1 (2016), 129–149.
[21]
Moin Khan and Kamran Malik. 2018. Sentiment classification of customer's reviews about automobiles in Roman Urdu. In Future of Information and Communication Conference, Singapore, Springer, 630–640.
[22]
Mohammed Korayem, Khalifeh Aljadda, and David Crandall. 2016. Sentiment/subjectivity analysis survey for languages other than English. Social Network Analysis and Mining 6, 1 (2016), 1–17.
[23]
Akshi Kumar, Kathiravan Srinivasan, Wen-Huang Cheng, and Albert Y. Zomaya. 2020. Hybrid context enriched deep learning model for fine-grained sentiment analysis in textual and visual semiotic modality social data. Information Processing & Management 57, 1 (2020), 102141.
[24]
Munish Kumar, Manish Kumar Jindal, Rajendra Kumar Sharma, and Simpel Rani Jindal. 2019. Character and numeral recognition for non-Indic and Indic scripts: A survey. Artificial Intelligence Review 52, 4 (2019), 2235–2261.
[25]
Munish Kumar, Manish Kumar Jindal, Rajendra Kumar Sharma, and Simpel Rani Jindal. 2020. Performance evaluation of classifiers for the recognition of offline handwritten Gurmukhi characters and numerals: A study. Artificial Intelligence Review 53, 3 (2020), 2075–2097.
[26]
Munish Kumar, Manish Kumar Jindal, Rajendra Kumar Sharma, and Simpel Rani Jindal. 2018. Performance comparison of several feature selection techniques for offline handwritten character recognition. In 2018 International Conference on Research in Intelligent and Computing in Engineering (RICE), San Salvador, El Salvador, IEEE, 1–6.
[27]
Munish Kumar, Simpel Rani Jindal, Manish Kumar Jindal, and Gurpreet Singh Lehal. 2019. Improved recognition results of medieval handwritten Gurmukhi manuscripts using boosting and bagging methodologies. Neural Processing Letters 50, 1 (2019), 43–56.
[28]
Munish Kumar and Simpel Rani Jindal. 2020. A study on recognition of pre-segmented handwritten multi-lingual characters. Archives of Computational Methods in Engineering 27, 2 (2020), 577–589.
[29]
Zainab Mahmood, Iqra Safder, Rao Muhammad Adeel Nawab, Faisal Bukhari, Raheel Nawaz, Ahmed S. Alfakeeh, Naif Radi Aljohani, and Saeed-Ul Hassan. 2020. Deep sentiments in Roman Urdu text using recurrent convolutional neural network model. Information Processing & Management 57, 4 (2020), 102233.
[30]
Walaa Medhat, Ahmed Hassan, and Hoda Korashy. 2014. Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal 5, 4 (2014), 1093–1113.
[31]
Khawar Mehmood, Daryl Essam, Kamran Shafi, and Muhammad Kamran Malik. 2019. Sentiment analysis for a resource poor language—Roman Urdu. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 1 (2019), 10.
[32]
Donatas Meškel and Flavius Frasincar. 2020. ALDONAr: A hybrid solution for sentence-level aspect-based sentiment analysis using a lexicalized domain ontology and a regularized neural attention model. Information Processing & Management 57, 3 (2020), 102211.
[33]
Neelam Mukhtar, Mohammad Abid Khan, Nadia Chiragh, and Shah Nazir. 2018. Identification and handling of intensifiers for enhancing accuracy of Urdu sentiment analysis. Expert Systems 35, 6 (2018), e12317.
[34]
Neelam Mukhtar, Mohammad Abid Khan, and Nadia Chiragh. 2018. Lexicon-based approach outperforms supervised machine learning approach for Urdu sentiment analysis in multiple domains. Telematics and Informatics 35, 8 (2018), 2173–2183.
[35]
Neelam Mukhtar and Mohammad Abid Khan. 2018. Urdu sentiment analysis using supervised machine learning approach. International Journal of Pattern Recognition and Artificial Intelligence 32, 02 (2018), 1851001.
[36]
Smruthi Mukund and Rohini K. Srihari. 2012. Analyzing Urdu social media for sentiments using transfer learning with controlled translations. In Proceedings of the 2nd Workshop on Language in Social Media, Montréal, Canada, ACL. 1–8.
[37]
Smruthi Mukund, Rohini Srihari, and Erik Peterson. 2010. An information-extraction system for Urdu—a resource-poor language. ACM Transactions on Asian Language Information Processing 9, 4 (2010), 15.
[38]
Sonika Narang, M. K. Jindal, and Munish Kumar. 2019. Devanagari ancient documents recognition using statistical feature extraction techniques. Sādhanā 44, 6 (2019), 1–8.
[39]
Sonika Rani Narang, Manish Kumar Jindal, and Munish Kumar. 2019. Devanagari ancient character recognition using DCT features with adaptive boosting and bootstrap aggregating. Soft Computing 23, 24 (2019), 13603–13614.
[40]
Sonika Rani Narang, M. K. Jindal, Shruti Ahuja, and Munish Kumar. 2020. On the recognition of Devanagari ancient handwritten characters using SIFT and Gabor features. Soft Computing 24, 22 (2020), 17279–17289.
[41]
Gule Zulf Nargis and Noreen Jamil. 2016. Generating an emotion ontology for Roman Urdu text. International Journal of Computational Linguistics Research 7, (2016), 83–91.
[42]
Faiza Noor, Maheen Bakhtyar, and Junaid Baber. 2019. Sentiment analysis in E-commerce using SVM on Roman Urdu text. In International Conference for Emerging Technologies in Computing, London, UK. Springer, 213–222.
[43]
Haiyun Peng, Erik Cambria, and Amir Hussain. 2017. A review of sentiment analysis research in Chinese language. Cognitive Computation 9, 4 (2017), 423–435.
[44]
Gabriele Pergola, Lin Gui, and Yulan He. 2019. TDAM: A topic-dependent attention model for sentiment analysis. Information Processing & Management 56, 6 (2019), 102084.
[45]
Toqir A. Rana, Bahrooz Bakht, Mehtab Afzal, Natash Ali Mian, Muhammad Waseem Iqbal, Abbas Khalid, and Muhammad Raza Naqvi. 2021. Extraction of opinion target using syntactic rules in Urdu text. Intelligent Automation & Soft Computing 29, 3 (2021), 839–853.
[46]
Toqir A. Rana, Yu-N. Cheah, and Sukumar Letchmunan. 2016. Topic modeling in sentiment analysis: A systematic review. Journal of ICT Research and Applications 10, 1 (2016), 76–93.
[47]
Toqir A. Rana, Yu-N. Cheah, and Tauseef Rana. 2020. Multi-level knowledge-based approach for implicit aspect identification. Applied Intelligence 50, 12 (2020), 4616–4630.
[48]
Toqir A. Rana and Yu-N. Cheah. 2016. Aspect extraction in sentiment analysis: Comparative analysis and survey. Artificial Intelligence Review 46, 4 (2016), 459–483.
[49]
Toqir A. Rana and Yu-N. Cheah. 2016. Exploiting sequential patterns to detect objective aspects from online reviews. In 2016 International Conference on Advanced Informatics: Concepts, Theory and Application (ICAICTA’16), Penang Malaysia. IEEE, 1–5.
[50]
Toqir A. Rana and Yu-N. Cheah. 2017. A two-fold rule-based model for aspect extraction. Expert Systems with Applications 89, (2017), 273–285.
[51]
Toqir A. Rana and Yu-N. Cheah. 2017. Improving aspect extraction using aspect frequency and semantic similarity-based approach for aspect-based sentiment analysis. In International Conference on Computing and Information Technology, Bangkok, Thailand, Springer, 317–326.
[52]
Toqir A. Rana and Yu-N. Cheah. 2018. Sequential patterns-based rules for aspect-based sentiment analysis. Advanced Science Letters 24, 2 (2018), 1370–1374.
[53]
Toqir A. Rana and Yu-N. Cheah. 2019. Sequential patterns rule-based approach for opinion target extraction from customer reviews. Journal of Information Science 45, 5 (2019), 643–655.
[54]
Toqir Ahmad Rana and Yu-N Cheah. 2015. Hybrid rule-based approach for aspect extraction and categorization from customer reviews. In 9th International Conference on IT in Asia (CITA’15), Sarawak, Malaysia. IEEE, 1–5.
[55]
Kumar Ravi and Vadlamani Ravi. 2015. A survey on opinion mining and sentiment analysis: Tasks, approaches and applications. Knowledge-Based Systems 89, (2015), 14–46.
[56]
Zia Ul Rehman and Imran Sarwar Bajwa. 2016. Lexicon-based sentiment analysis for Urdu language. In 6th International Conference on Innovative Computing Technology (INTECH’16), Dublin, Ireland. IEEE, 497–501.
[57]
Seyed Mahdi Rezaeinia, Rouhollah Rahmani, Ali Ghodsi, and Hadi Veisi. 2019. Sentiment analysis based on improved pre-trained word embeddings. Expert Systems with Applications 117, (2019), 139–147.
[58]
Kim Schouten and Flavius Frasincar. 2015. Survey on aspect-level sentiment analysis. IEEE Transactions on Knowledge and Data Engineering 28, 3 (2015), 813–830.
[59]
Zareen Sharf and Saif Ur Rahman. 2017. Lexical normalization of Roman Urdu text. International Journal of Computer Science and Network Security 17, 12 (2017), 213–221.
[60]
Chakkrit Snae. 2007. A comparison and analysis of name matching algorithms. International Journal of Applied Science, Engineering and Technology 4, 1 (2007), 252–257.
[61]
Omayya Sohail, Inam Elahi, Ahsan Ijaz, Asim Karim, and Faisal Kamiran. 2018. Text classification in an under-resourced language via lexical normalization and feature pooling. In PACIS, Yokohama, Japan. 96.
[62]
Chao Song, Xiao-Kang Wang, Peng-fei Cheng, Jian-qiang Wang, and Lin Li. 2020. SACPC: A framework based on probabilistic linguistic terms for short text sentiment analysis. Knowledge-Based Systems 194, (2020), 105572.
[63]
Afraz Z. Syed, Muhammad Aslam, and Ana Maria Martinez-Enriquez. 2010. Lexicon based sentiment analysis of Urdu text using SentiUnits. In Mexican International Conference on Artificial Intelligence, Pachuca, Mexico. Springer, 32–43.
[64]
Afraz Zahra Syed, Muhammad Aslam, and Ana Maria Martinez-Enriquez. 2011. Sentiment analysis of Urdu language: Handling phrase-level negation. In Mexican International Conference on Artificial Intelligence, Puebla, Mexico. Springer, 382–393.
[65]
Srishti Vashishtha and Seba Susan. 2019. Fuzzy rule based unsupervised sentiment analysis from social media posts. Expert Systems with Applications 138, (2019), 112834.
[66]
Guixian Xu, Yueting Meng, Xiaoyu Qiu, Ziheng Yu, and Xu Wu. 2019. Sentiment analysis of comment texts based on BiLSTM. IEEE Access 7, (2019), 51522–51532.
[67]
Chao Yang, Hefeng Zhang, Bin Jiang, and Keqin Li. 2019. Aspect-based sentiment analysis with alternating coattention networks. Information Processing & Management 56, 3 (2019), 463–478.
[68]
Lei Zhang, Shuai Wang, and Bing Liu. 2018. Deep learning for sentiment analysis: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, 4 (2018), e1253.

Cited By

View all
  • (2025)Optimized Identification of Sentence-Level Multiclass Events on Urdu-Language-Text Using Machine Learning TechniquesIEEE Access10.1109/ACCESS.2024.352299213(1-25)Online publication date: 2025
  • (2024)Enhanced Lexicon based Hybrid Method for Slang and Punctuation Scoring for Aspect Based Sentiment Analysis2024 6th International Conference on Electrical Engineering and Information & Communication Technology (ICEEICT)10.1109/ICEEICT62016.2024.10534389(1418-1422)Online publication date: 2-May-2024
  • (2023)Count Me Too: Sentiment Analysis of Roman Sindhi ScriptSage Open10.1177/2158244023119745213:3Online publication date: 29-Sep-2023
  • Show More Cited By

Index Terms

  1. An Unsupervised Approach for Sentiment Analysis on Social Media Short Text Classification in Roman Urdu

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 21, Issue 2
    March 2022
    413 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3494070
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 November 2021
    Accepted: 01 July 2021
    Revised: 01 May 2021
    Received: 01 January 2021
    Published in TALLIP Volume 21, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Sentiment analysis
    2. roman urdu
    3. opinion extraction
    4. text normalization
    5. roman urdu text classification

    Qualifiers

    • Research-article
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)125
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 15 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Optimized Identification of Sentence-Level Multiclass Events on Urdu-Language-Text Using Machine Learning TechniquesIEEE Access10.1109/ACCESS.2024.352299213(1-25)Online publication date: 2025
    • (2024)Enhanced Lexicon based Hybrid Method for Slang and Punctuation Scoring for Aspect Based Sentiment Analysis2024 6th International Conference on Electrical Engineering and Information & Communication Technology (ICEEICT)10.1109/ICEEICT62016.2024.10534389(1418-1422)Online publication date: 2-May-2024
    • (2023)Count Me Too: Sentiment Analysis of Roman Sindhi ScriptSage Open10.1177/2158244023119745213:3Online publication date: 29-Sep-2023
    • (2022)A New Approach for Social Networks Based on Ontology of Multilingual Dynamic GroupsInternational Journal of Organizational and Collective Intelligence10.4018/IJOCI.30488812:1(1-21)Online publication date: 30-Jun-2022
    • (2022)Lexical Normalization of Roman Urdu2022 24th International Multitopic Conference (INMIC)10.1109/INMIC56986.2022.9972968(1-5)Online publication date: 21-Oct-2022

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media