A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts
<p>The weighting of a new word occurred in two testing short texts.</p> "> Figure 2
<p>The schematic diagram of the hybrid model.</p> "> Figure 3
<p>The training process of the hybrid model.</p> "> Figure 4
<p>The testing process of the hybrid model.</p> ">
Abstract
:1. Introduction
- A novel new word weighting method based on the ANN model is developed. The weight of a word measures its likelihood of being densely distributed in one category. The weight of a new word in a short text is weighted based on the weights of its neighbor words and the probabilities yielded by the ANN.
- When all words are properly weighted, a hybrid model that combines the ANN and an HMM is proposed for accurate and fast short text filtering. The HMM is used to predict the likelihood of a short text being spam.
- The hybrid model represents pioneering research in the specialized domain of short text filtering, addressing unique challenges like limited length and feature sparsity with novel approaches.
2. The Hybrid Model for Short Text Filtering
2.1. Short Text Pre-Processing
- Case folding converts all capital letters in the data set into lowercase characters.
- Tokenization divides the raw text into individual words.
- Stemming and Lemmatization chop off the affixes of words and transforms them into their base form.
- Stop words removal eliminates the common words that only stand for positioning.
2.2. Feature Extraction
2.3. ANN for New Word Weighting
2.3.1. Short Text Representation
2.3.2. Training of the ANN
2.3.3. New Word Weighting Based on ANN Probability
Algorithm 1 The New Word Weighting Algorithm |
|
2.4. The HMM for Short Text Filtering
2.4.1. Short Text Representation for the HMM
2.4.2. HMM Formulation and Notation
2.4.3. Training of the HMM
2.5. The Proposed Hybrid Model
2.5.1. The Asynchronous Training of the Hybrid Model
2.5.2. Filtering with New Word Weighting of the Hybrid Model
3. Experiments and Results
3.1. Experiment on the UCI SMS Data Set
3.2. Experiment Results and Comparisons on the UCI SMS Data Set
3.2.1. Experiment Results of the HMM
3.2.2. Experiment Results of the ANN
3.2.3. Comparisons on the UCI SMS
3.3. Experiments and Results on Other Data Sets
- Dahan: The dataset used in this study is the Dahan SMS spam dataset, containing 14,943 ham messages and 5762 spam messages, in Chinese. These SMS messages were collected in collaboration with our partner company, which operates an enterprise short message service platform handling an average of 150 million short messages daily. The ham messages primarily include notifications from express delivery services, banks, and e-commerce platforms, while the spam messages occasionally originate from registered platform users for advertising purposes.
- MR: A benchmark short text data set of movie reviews [30]. The data set is balanced and contains 5331 positive reviews and 5331 negative reviews, respectively. These movie reviews are processed sentences from the movie reviews published on the website rottentomatoes.com
- CR: A benchmark short text data set of customer reviews [31]. The data set is imbalanced. It includes 2406 positive and 1367 negative reviews for digital products, such as DVD players, MP3 players, and cameras.
- SST-1: A benchmark data set for sentiment analysis, called Stanford sentiment treebank [32]. The data set is extended from MR and is refined with additional labels. In the experiment, the reviews with very positive, positive, negative, and very negative are reserved. The reviews with neutral labels are excluded.
3.4. Training Time and Throughput Results
3.5. Discussions
3.5.1. Performance Analysis
3.5.2. Multi-Language Filtering Capabilities
3.6. Limitations
4. Conclusions and Future Work
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Al Sulaimani, S.; Starkey, A. Short Text Classification Using Contextual Analysis. IEEE Access 2021, 9, 149619–149629. [Google Scholar] [CrossRef]
- Bakr, Y.; Tolba, A.; Meshreki, H. Drivers of SMS advertising acceptance: A mixed-methods approach. J. Res. Interact. Mark. 2019, 13, 96–118. [Google Scholar] [CrossRef]
- Alsmadi, I.; Gan, K.H. Review of short-text classification. Int. J. Web Inf. Syst. 2019, 15, 155–182. [Google Scholar] [CrossRef]
- Gao, Z.; Li, Z.; Luo, J.; Li, X. Short text aspect-based sentiment analysis based on CNN+ BiGRU. Appl. Sci. 2022, 12, 2707. [Google Scholar] [CrossRef]
- Ghanem, R.; Erbay, H. Spam detection on social networks using deep contextualized word representation. Multimed. Tools Appl. 2023, 82, 3697–3712. [Google Scholar] [CrossRef]
- Abayomi-Alli, O.; Misra, S.; Abayomi-Alli, A.; Odusami, M. A review of soft techniques for SMS spam classification: Methods, approaches and applications. Eng. Appl. Artif. Intell. 2019, 86, 197–212. [Google Scholar] [CrossRef]
- Ruan, S.; Chen, B.; Song, K.; Li, H. Weighted naïve Bayes text classification algorithm based on improved distance correlation coefficient. Neural Comput. Appl. 2022, 34, 2729–2738. [Google Scholar] [CrossRef]
- Samant, S.S.; Murthy, N.L.B.; Malapati, A. Improving Term Weighting Schemes for Short Text Classification in Vector Space Model. IEEE Access 2019, 7, 166578–166592. [Google Scholar] [CrossRef]
- Dang, E.K.F.; Luk, R.W.P.; Allan, J. Context-dependent feature values in text categorization. Int. J. Softw. Eng. Knowl. Eng. 2020, 30, 1199–1219. [Google Scholar] [CrossRef]
- Oyelade, O.N.; Agushaka, J.O.; Ezugwu, A.E. Evolutionary binary feature selection using adaptive ebola optimization search algorithm for high-dimensional datasets. PLoS ONE 2023, 18, e0282812. [Google Scholar] [CrossRef]
- Bansal, B.; Srivastava, S. Hybrid attribute based sentiment classification of online reviews for consumer intelligence. Appl. Intell. 2019, 49, 137–149. [Google Scholar] [CrossRef]
- Bello, A.; Ng, S.C.; Leung, M.F. A BERT framework to sentiment analysis of tweets. Sensors 2023, 23, 506. [Google Scholar] [CrossRef] [PubMed]
- Machicao, J.; Corrêa, E.A., Jr.; Miranda, G.H.; Amancio, D.R.; Bruno, O.M. Authorship attribution based on life-like network automata. PLoS ONE 2018, 13, e0193703. [Google Scholar] [CrossRef] [PubMed]
- Ghourabi, A.; Alohaly, M. Enhancing Spam Message Classification and Detection Using Transformer-Based Embedding and Ensemble Learning. Sensors 2023, 23, 3861. [Google Scholar] [CrossRef] [PubMed]
- Liao, W.; Zeng, B.; Yin, X.; Wei, P. An improved aspect-category sentiment analysis model for text sentiment analysis based on RoBERTa. Appl. Intell. 2021, 51, 3522–3533. [Google Scholar] [CrossRef]
- Wang, H.; Tian, K.; Wu, Z.; Wang, L. A Short Text Classification Method Based on Convolutional Neural Network and Semantic Extension. Int. J. Comput. Intell. Syst. 2021, 14, 367–375. [Google Scholar] [CrossRef]
- Cai, T.; Zhang, X. Imbalanced Text Sentiment Classification Based on Multi-Channel BLTCN-BLSTM Self-Attention. Sensors 2023, 23, 2257. [Google Scholar] [CrossRef] [PubMed]
- Abid, M.A.; Ullah, S.; Siddique, M.A.; Mushtaq, M.F.; Aljedaani, W.; Rustam, F. Spam SMS filtering based on text features and supervised machine learning techniques. Multimed. Tools Appl. 2022, 81, 39853–39871. [Google Scholar] [CrossRef]
- Qian, Y.; Du, Y.; Deng, X.; Ma, B.; Ye, Q.; Yuan, H. Detecting new Chinese words from massive domain texts with word embedding. J. Inf. Sci. 2019, 45, 196–211. [Google Scholar] [CrossRef]
- Duan, J.; Tan, Z.; Zhang, M.; Wang, H. New word detection using BiLSTM+CRF model with features. IEICE Trans. Inf. Syst. 2020, E103D, 2228–2236. [Google Scholar] [CrossRef]
- Xia, T.; Chen, X. A weighted feature enhanced Hidden Markov Model for spam SMS filtering. Neurocomputing 2021, 444, 48–58. [Google Scholar] [CrossRef]
- Salton, G.; Wong, A.; Yang, C.S. A vector space model for automatic indexing. Commun. ACM 1975, 18, 613–620. [Google Scholar] [CrossRef]
- Jain, G.; Sharma, M.; Agarwal, B. Spam detection in social media using convolutional and long short term memory neural network. Ann. Math. Artif. Intell. 2019, 85, 21–44. [Google Scholar] [CrossRef]
- Mishra, S.; Soni, D. Smishing Detector: A security model to detect smishing through SMS content analysis and URL behavior analysis. Future Gener. Comput. Syst. 2020, 108, 803–815. [Google Scholar] [CrossRef]
- Ghourabi, A.; Mahmood, M.A.; Alzubi, Q.M. A hybrid CNN-LSTM model for SMS spam detection in arabic and english messages. Future Internet 2020, 12, 156. [Google Scholar] [CrossRef]
- Nagwani, N.K.; Sharaff, A. SMS spam filtering and thread identification using bi-level text classification and clustering techniques. J. Inf. Sci. 2017, 43, 75–87. [Google Scholar] [CrossRef]
- Shaaban, M.A.; Hassan, Y.F.; Guirguis, S.K. Deep convolutional forest: A dynamic deep ensemble approach for spam detection in text. Complex Intell. Syst. 2022, 8, 4897–4909. [Google Scholar] [CrossRef]
- Roy, P.K.; Singh, J.P.; Banerjee, S. Deep learning to filter SMS Spam. Future Gener. Comput. Syst. 2020, 102, 524–533. [Google Scholar] [CrossRef]
- Xia, T.; Chen, X. A discrete hidden Markov model for SMS spam detection. Appl. Sci. 2020, 10, 5011. [Google Scholar] [CrossRef]
- Pang, B.; Lee, L. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. arXiv 2005, arXiv:cs/0506075. [Google Scholar]
- Hu, M.; Liu, B. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 22–25 August 2004; pp. 168–177. [Google Scholar] [CrossRef]
- Socher, R.; Perelygin, A.; Wu, J.; Chuang, J.; Manning, C.D.; Ng, A.Y.; Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1631–1642. [Google Scholar]
- Liu, Z.; Kan, H.; Zhang, T.; Li, Y. DUKMSVM: A framework of deep uniform kernel mapping support vector machine for short text classification. Appl. Sci. 2020, 10, 2348. [Google Scholar] [CrossRef]
- Wang, R.; Li, Z.; Cao, J.; Chen, T.; Wang, L. Convolutional recurrent neural networks for text classification. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–6. [Google Scholar] [CrossRef]
- Cheng, Y.; Yao, L.; Xiang, G.; Zhang, G.; Tang, T.; Zhong, L. Text Sentiment Orientation Analysis Based on Multi-Channel CNN and Bidirectional GRU with Attention Mechanism. IEEE Access 2020, 8, 134964–134975. [Google Scholar] [CrossRef]
- Zhang, Z.; Robinson, D.; Tepper, J. Detecting Hate Speech on Twitter Using a Convolution-GRU Based Deep Neural Network. In Proceedings of the 15th Semantic Web International Conference, Heraklion, Greece, 3–7 June 2018; pp. 745–760. [Google Scholar] [CrossRef]
- Wang, Y.; Huang, M.; Zhao, L.; Zhu, X. Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the EMNLP 2016—Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 606–615. [Google Scholar] [CrossRef]
- Zhou, P.; Qi, Z.; Zheng, S.; Xu, J.; Bao, H.; Xu, B. Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling. arXiv 2016, arXiv:1611.06639. [Google Scholar] [CrossRef]
Spam | −0.07713 | 0.35576 |
Ham | 0.19436 | 0.28185 |
Actual | Predicted | Percentage% | AUC | |||
---|---|---|---|---|---|---|
Spam | Ham | Spam | Ham | |||
Spam | 249 | 227 | 22 | 91.2 | 8.8 | 0.960 |
Ham | 1609 | 16 | 1593 | 1.0 | 99.0 |
Model | Class | Acc | Prec | Rec | F1 |
---|---|---|---|---|---|
NB [24] | Overall | 0.916 | 0.93 | 0.92 | 0.92 |
SVM [23] | Overall | 0.931 | 0.929 | 0.931 | 0.930 |
DT [25] | Overall | 0.965 | 0.854 | 0.782 | 0.816 |
LDA [26] | Overall | 0.904 | 0.96 | 0.976 | 0.92 |
LSTM [27] | Spam | 0.977 | 0.948 | 0.901 | 0.926 |
Ham | 0.982 | 0.990 | 0.986 | ||
3CNN [28] | Spam | 0.979 | 0.988 | 0.858 | 0.922 |
Ham | 0.982 | 0.996 | 0.988 | ||
Our previous | Spam | 0.959 | 0.892 | 0.816 | 0.852 |
work 1 [29] | Ham | 0.969 | 0.983 | 0.976 | |
Our previous | Spam | 0.969 | 0.936 | 0.850 | 0.891 |
work 2 [21] | Ham | 0.975 | 0.990 | 0.982 | |
Our | Spam | 0.980 | 0.912 | 0.934 | 0.923 |
hybrid model | Ham | 0.990 | 0.986 | 0.988 |
Data Set | Class | Acc | Prec | Rec | F1 |
---|---|---|---|---|---|
Dahan | Spam | 0.989 | 0.983 | 0.977 | 0.980 |
Ham | 0.991 | 0.993 | 0.992 | ||
MR | Spam | 0.816 | 0.812 | 0.819 | 0.815 |
Ham | 0.821 | 0.814 | 0.817 | ||
CR | Spam | 0.852 | 0.831 | 0.770 | 0.800 |
Ham | 0.859 | 0.900 | 0.879 | ||
SST-1 | Spam | 0.459 | 0.456 | 0.459 | 0.457 |
Ham | 0.462 | 0.459 | 0.460 |
Models | MR | CR | SST-1 |
---|---|---|---|
SVM [33] | 0.772 | 0.795 | - |
NB [33] | 0.737 | 0.836 | - |
RCNN [34] | 0.816 | 0.845 | - |
Bi-GRU [35] | 0.815 | 0.844 | - |
Convolutional-GRU [36] | 0.819 | 0.850 | - |
LSTM-att [37] | 0.823 | 0.817 | 0.454 |
Bi-LSTM-att [38] | 0.797 | 0.814 | 0.461 |
The proposed hybrid model | 0.816 | 0.852 | 0.459 |
Models | Training Time (s) | Training Speed (Messages/s) Speed | Filtering Time (s) | Throughput (Messages/s) |
---|---|---|---|---|
LSTM | 47.822 | 76 | 6.854 | 286 |
1CNN | 5.022 | 720 | 0.136 | 14,345 |
3CNN | 11.690 | 309 | 0.247 | 7933 |
The proposed hybrid model | 0.217 | 169,240 | 0.017 | 109,577 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xia, T.; Chen, X.; Wang, J.; Qiu, F. A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts. Sensors 2023, 23, 8975. https://doi.org/10.3390/s23218975
Xia T, Chen X, Wang J, Qiu F. A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts. Sensors. 2023; 23(21):8975. https://doi.org/10.3390/s23218975
Chicago/Turabian StyleXia, Tian, Xuemin Chen, Jiacun Wang, and Feng Qiu. 2023. "A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts" Sensors 23, no. 21: 8975. https://doi.org/10.3390/s23218975
APA StyleXia, T., Chen, X., Wang, J., & Qiu, F. (2023). A Hybrid Model with New Word Weighting for Fast Filtering Spam Short Texts. Sensors, 23(21), 8975. https://doi.org/10.3390/s23218975