Spam SMS filtering based on text features and supervised machine learning techniques

Muhammad Adeel Abid¹,
Saleem Ullah¹,
Muhammad Abubakar Siddique¹,
Muhammad Faheem Mushtaq²,
Wajdi Aljedaani³ &
…
Furqan Rustam ORCID: orcid.org/0000-0001-8403-1047⁴

986 Accesses
23 Citations
1 Altmetric
Explore all metrics

Abstract

The advancement in technology made a significant mark with time, which affects every field of life like medicine, music, office, traveling, and communication. Telephone lines are used as a communication medium in ancient times. Currently, wireless technology overrides telephone wire technology with much broader features. The advertisement agencies and spammers mostly use SMS as a medium of communication to convey their business brochures to the typical person. Due to this reason, more than 60% of spam SMS are received daily. These spam messages cause users’ anger and sometimes scam with innocent users, but it creates large profits for the spammer and advertisement companies. This study proposed an approach for the classification of spam and ham SMS using supervised machine learning techniques. The feature extracting techniques such as Term Frequency-Inverse Document Frequency (TF-IDF) and bag-of-words are used to extract features from data. The SMS dataset used was imbalanced, and to solve this problem, we used over-sampling and under-sampling techniques. The support vector classifier, gradient boosting machine, random forest, Gaussian Naive Bayes, and logistics regression are applied on the spam and ham SMS dataset to evaluate the performance using accuracy, precision, recall, and F1 score. The experiment result shows that the random forest classifies spam ham SMS more accurately with 99% accuracy. The proposed model is trained well to identify the SMS category in terms of Ham or Spam with TF-IDF features and oversampling technique. The performance of the proposed approach was also evaluated on the spam email dataset with significant 99% accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

A Proposal of Systematic SMS Spam Detection Model Using Supervised Machine Learning Classifiers

A Comparative Study of Spam SMS Detection Techniques for English Content Using Supervised Machine Learning Algorithms

Detection of Spam in SMS Using Machine Learning Algorithms

Data Availability

The used dataset is publicly available on Kaggle. https://www.kaggle.com/uciml/sms-spam-collection-dataset/

References

Abid MA, Mushtaq MF, Akram U, Mughal B, Ahmad M, Imran M (2020) Recommending domain specific keywords for twitter. In: International conference on soft computing and data mining, Springer, pp 253–263
Ahmed I, Guan D, Chung T C (2014) Sms classification based on naive bayes classifier and apriori algorithm frequent itemset. Int J Mach Learn Comput 4(2):183
Article Google Scholar
Alkhazi B, DiStasi A, Aljedaani W, Alrubaye H, Ye X, Mkaouer M W (2020) Learning to rank developers for bug report assignment. Appl Soft Comput 106667:95
Google Scholar
AlOmar EA, Aljedaani W, Tamjeed M, Mkaouer MW, El-Glaly YN (2021) Finding the needle in a haystack: On the automatic identification of accessibility user reviews. In: Proceedings of the 2021 CHI conference on human factors in computing systems, pp 1–15
Angeli A, Filliat D, Doncieux S, Meyer J A (2008) Fast and incremental method for loop-closure detection using bags of visual words. IEEE Trans Robot 24(5):1027–1037
Article Google Scholar
Benevenuto F, Magno G, Rodrigues T, Almeida V (2010) Detecting spammers on twitter Collaboration, electronic messaging, anti-abuse and spam conference (CEAS), vol 6, p 12
Bo H, Xiao-Ling R, ZHANG C J, Qin H Q, Chong-Hui G (2017) (2017) Telephone Traffic forecasting of electric system based on multi-factor decomposition. In: 3rd Annual International Conference on Electronics, Electrical Engineering and Information Science. Atlantis Press, EEEIS
Google Scholar
Cernian A, Carstoiu D, Olteanu A, Sgarciu V (2016) Assessing the performance of compression based clustering for text mining. Econ Comput Econ Cybern Stud Res 50:2
Google Scholar
Chawla N V, Bowyer K W, Hall L O, Kegelmeyer W P (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article Google Scholar
Cormack GV, Hidalgo JMG, Sánz EP (2007) Feature engineering for mobile (sms) spam filtering. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, pp 871–872
Dittman DJ, Khoshgoftaar TM, Wald R, Napolitano A (2014) Comparison of data sampling approaches for imbalanced bioinformatics data. In: The twenty-seventh international FLAIRS conference
Doma V, Kendre S, Bhagwat L (2018) Detecting hate speech and offensive language on twitter using machine learning: An n-gram and tfidf based approach. arXiv:180908651
Duc G M, Manh L, et al. (2016) A novel method to improve the speed and the accuracy of location prediction algorithm of mobile users for cellular networks. Chuyê,n san Các công trình nghiên cu, phát trin và ng dng Công ngh thông tin và Truyn thông
Fallgren M, Abbas T, Allio S, Alonso-Zarate J, Fodor G, Gallo L, Kousaridas A, Li Y, Li Z, Li Z et al (2019) Multicast and broadcast enablers for high-performing cellular v2x systems. IEEE Trans Broadcast 65(2):454–463
Article Google Scholar
Fang F, Wu J, Li Y, Ye X, Aljedaani W, Mkaouer MW (2021) On the classification of bug reports to improve bug localization. Soft Comput 25(11):7307–7323
Article Google Scholar
Faris H, Ala’m AZ, Heidari AA, Aljarah I, Mafarja M, Hassonah MA, Fujita H (2019) An intelligent system for spam detection and identification of the most relevant features based on evolutionary random weight networks. Information Fusion 48:67–83
Article Google Scholar
Fernández A, Garcia S, Herrera F, Chawla N V (2018) Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. J Artif Intell Res 61:863–905
Article MathSciNet Google Scholar
Fraser J S, Wang W J, He H S, Thompson F R (2019) Modeling post-fire tree mortality using a logistic regression method within a forest landscape model. Forests 10(1):25
Article Google Scholar
Gadde S, Lakshmanarao A, Satyanarayana S (2021) Sms spam detection using machine learning and deep learning techniques 2021 7Th international conference on advanced computing and communication systems (ICACCS), vol 1, pp 358–362, DOI 10.1109/ICACCS51430.2021.9441783
Gayathri B, Sumathi C (2016) An automated technique using gaussian naïve bayes classifier to classify breast cancer. Int J Comput Appl 148(6):16–21
Google Scholar
Ghosh A, Maeder A, Baker M, Chandramouli D (2019). 5g evolution: A view on 5g cellular technology beyond 3gpp release 15. IEEE Access 7:127639–127651
Gómez Hidalgo JM, Bringas GC, Sánz EP, García FC (2006) Content based sms spam filtering. In: Proceedings of the 2006 ACM symposium on Document engineering, pp 107–114
Ishtiaq A, Islam M A, Iqbal M A, Aleem M, Ahmed U (2019) Graph centrality based spam sms detection. In: 2019 16Th international bhurban conference on applied sciences and technology. IEEE, IBCAST, pp 629–633
Jamil R, Ashraf I, Rustam F, Saad E, Mehmood A, Choi G S (2021) Detecting sarcasm in multi-domain datasets using convolutional neural networks and long short term memory network model. PeerJ Computer Science e645:7
Google Scholar
Kaggle (2016) Sms spam collection dataset. https://www.kaggle.com/uciml/sms-spam-collection-dataset/. Accessed 20 Apr 2021
Kaggle (2021) Spam mails dataset. https://www.kaggle.com/venky73/spam-mails-dataset. Accessed 24 Apr 2021
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu T Y (2017) Lightgbm: a highly efficient gradient boosting decision tree. Advances in neural information processing systems 30:3146–3154
Google Scholar
Lee H Y, Kang SS (2019) Word embedding method of sms messages for spam message filtering, IEEE, BigComp
Lee MC, Chang JW, Hsieh TC, Chen HH, Chen CH (2012) A sentence similarity metric based on semantic patterns. Adv Inf Sci Serv Sci 4:18
Google Scholar
Lin W C, Tsai C F, Hu Y H, Jhang J S (2017) Clustering-based undersampling in class-imbalanced data. Inf Sci 409:17–26
Article Google Scholar
Mujahid M, Lee E, Rustam F, Washington P B, Ullah S, Reshi A A, Ashraf I (2021) Sentiment analysis and topic modeling on tweets about online education during covid-19. Appl Sci 11(18):8438
Article Google Scholar
Nagwani N K, Sharaff A (2017) Sms spam filtering and thread identification using bi-level text classification and clustering techniques. J Inf Sci 43 (1):75–87
Article Google Scholar
Nikam S, Chaudhari R (2017) A review paper on image spam filtering
Pavlopoulos S, Kyriacou E, Berler A, Dembeyiotis S, Koutsouris D (1998) A novel emergency telemedicine system based on wireless communication technology-ambulance. IEEE Trans Inf Technol Biomed 2(4):261–267
Article Google Scholar
Ramsingh J, Bhuvaneswari V (2021) An efficient map reduce-based hybrid nbc-tfidf algorithm to mine the public sentiment on diabetes mellitus–a big data approach. J King Saud University Comput Inf Sci 33(8):1018–1029
Google Scholar
Roy P K, Singh J P, Banerjee S (2020) Deep learning to filter sms spam. Futur Gener Comput Syst 102:524–533
Article Google Scholar
Rupapara V, Rustam F, Amaar A, Washington PB, Lee E, Ashraf I (2021a) Deepfake tweets classification using stacked bi-lstm and words embedding. PeerJ Computer Science 7:e745
Article Google Scholar
Rupapara V, Rustam F, Shahzad HF, Mehmood A, Ashraf I, Choi GS (2021b) Impact of smote on imbalanced text features for toxic comments classification using rvvc model. IEEE Access
Russo D P, Zorn K M, Clark A M, Zhu H, Ekins S (2018) Comparing multiple machine learning algorithms and metrics for estrogen receptor binding prediction. Mol Pharm 15(10):4361–4370
Article Google Scholar
Rustam F, Ashraf I, Mehmood A, Ullah S, Choi G S (2019) Tweets classification on the base of sentiments for us airline companies. Entropy 21(11):1078
Article Google Scholar
Safdari N, Alrubaye H, Aljedaani W, Baez BB, DiStasi A, Mkaouer MW (2019) Learning to rank faulty source files for dependent bug reports. In: Big Data: learning, analytics, and applications, international society for optics and photonics, vol 10989, p 109890B
Sajedi H, Parast G Z, Akbari F (2016) Sms spam filtering using machine learning techniques: a survey. Mach Learn Res 1(1):1
Google Scholar
Shafi’I MA, Abd Latiff MS, Chiroma H, Osho O, Abdul-Salaam G, Abubakar AI, Herawan T (2017) A review on mobile sms spam filtering techniques. IEEE Access 5:15650–15666
Article Google Scholar
Sisodia DS, Mahapatra S, Sharma A (2020) Automated sms classification and spam analysis using topic modeling. In: 2nd International Conference on Data, Engineering and Applications (IDEA), pp 1–6
Sohn DN, Lee JT, Rim HC (2009) The contribution of stylistic information to content-based mobile spam filtering. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pp 321–324
Speiser JL, Wolf BJ, Chung D, Karvellas CJ, Koch DG, Durkalski VL (2019) Bimm forest: a random forest method for modeling clustered and longitudinal binary outcomes. Chemometr Intell Lab Syst 185:122–134
Article Google Scholar
Subramaniam T, Jalab HA, Taqa AY (2010) Overview of textual anti-spam filtering techniques. Int J Phys Sci 5(12):1869–1882
Google Scholar
VRL N (2009) An unsupervised approach to domain-specific term extraction. In: Australasian language technology association workshop, vol 2009, p 94
Willig A, Matheus K, Wolisz A (2005) Wireless technology in industrial networks. Proc IEEE 93(6):1130–1151
Article Google Scholar
Xia T, Chen X (2020) A discrete hidden markov model for sms spam detection. Appl Sci 10(14):5011
Article Google Scholar
Zamel Y K, Ali S A, Naser M A (2018) Analysis study of spam image-based emails filtering techniques. Int J Pur Appl Math 119(15):325–346
Google Scholar

Download references

Acknowledgements

The authors would like to thank the Department of Software Engineering, School of Systems and Technology, University of Management & Technology, for providing a research-oriented environment.

Author information

Authors and Affiliations

Khwaja Fareed University of Engineering and Information Technology, Rahim Yar Khan, Pakistan
Muhammad Adeel Abid, Saleem Ullah & Muhammad Abubakar Siddique
The Islmia University of Bahwalpur, Bahwalpur, Pakistan
Muhammad Faheem Mushtaq
University of North Texas, Denton, TX, 76203, USA
Wajdi Aljedaani
Department of Software Engineering, School of Systems and Technology, University of Management and Technology, 54770, Lahore, Pakistan
Furqan Rustam

Authors

Muhammad Adeel Abid
View author publications
You can also search for this author in PubMed Google Scholar
Saleem Ullah
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Abubakar Siddique
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Faheem Mushtaq
View author publications
You can also search for this author in PubMed Google Scholar
Wajdi Aljedaani
View author publications
You can also search for this author in PubMed Google Scholar
Furqan Rustam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Furqan Rustam.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abid, M.A., Ullah, S., Siddique, M.A. et al. Spam SMS filtering based on text features and supervised machine learning techniques. Multimed Tools Appl 81, 39853–39871 (2022). https://doi.org/10.1007/s11042-022-12991-0

Download citation

Received: 20 May 2021
Revised: 18 January 2022
Accepted: 27 March 2022
Published: 04 May 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s11042-022-12991-0

Spam SMS filtering based on text features and supervised machine learning techniques

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Proposal of Systematic SMS Spam Detection Model Using Supervised Machine Learning Classifiers

A Comparative Study of Spam SMS Detection Techniques for English Content Using Supervised Machine Learning Algorithms

Detection of Spam in SMS Using Machine Learning Algorithms

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Spam SMS filtering based on text features and supervised machine learning techniques

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Proposal of Systematic SMS Spam Detection Model Using Supervised Machine Learning Classifiers

A Comparative Study of Spam SMS Detection Techniques for English Content Using Supervised Machine Learning Algorithms

Detection of Spam in SMS Using Machine Learning Algorithms

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation