[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Content-based analysis to detect Arabic web spam

Published: 01 June 2012 Publication History

Abstract

Search engines are important outlets for information query and retrieval. They have to deal with the continual increase of information available on the web, and provide users with convenient access to such huge amounts of information. Furthermore, with this huge amount of information, a more complex challenge that continuously gets more and more difficult to illuminate is the spam in web pages. For several reasons, web spammers try to intrude in the search results and inject artificially biased results in favour of their websites or pages. Spam pages are added to the internet on a daily basis, thus making it difficult for search engines to keep up with the fast-growing and dynamic nature of the web, especially since spammers tend to add more keywords to their websites to deceive the search engines and increase the rank of their pages. In this research, we have investigated four different classification algorithms (naïve Bayes, decision tree, SVM and K-NN) to detect Arabic web spam pages, based on content. The three groups of datasets used, with 1%, 15% and 50% spam contents, were collected using a crawler that was customized for this study. Spam pages were classified manually. Different tests and comparisons have revealed that the Decision Tree was the best classifier for this purpose.

References

[1]
Ryding KCA reference grammar of modern standard Arabic. Cambridge: Cambridge University Press; 2005:
[2]
MENA Online Advertising Industry slideshare. http://www.slideshare.net/aitmit/mena-online-advertising-industry (accessed 10 October 2011).
[3]
Gyongyi Z,Garcia-Molina H,Pedersen J.Combating web spam with TrustRank.Proceedings of the 30th International Conference on Very Large Databases (VLDB); 2004Toronto, Canada; 2004. 576.
[4]
Gyongyi Z,Garcia-Molina H.Web spam taxonomy.Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web; 2005Chiba, Japan; 2005. 1.
[5]
Jindal N,Liu B.Analyzing and detecting review spam.Proceedings of the 7th IEEE International Conference on Data Mining (ICDM 2007); 2007Omaha, NE; 2007. 547.
[6]
Wang W,Zeng G,Sun M,Gu H,Zhang QAPWeb/WAIM 2007 Ws. Chang KC, ed. 2007:299-307.
[7]
Wang W,Zeng GTrust management. Etalle SMarsh S, ed. Boston: Springer; 2007:139-152.
[8]
Wang W,Zeng G,Tang D.Using evidence based content trust model for spam detection.Expert Systems with Applications. 2010;37 (8): 5599-5606
[9]
Ntoulas A,Najork M,Manasse M,Fetterly D.Detecting spam web pages through content analysis.Proceedings of the 15th International World Wide Web Conference; 2006Edinburgh, Scotland; 2006. 83.
[10]
Fetterly D.Adversarial information retrieval: the manipulation of web content.ACM Computing Reviews. 2007;:
[11]
Svore KM,Wu Q,Burges CJC,Raman A.Improving web spam classification using rank-time features.Proceedings of AIRWeb '07; 2007Banff, Alberta, Canada; 2007. 9.
[12]
Pera MS,Ng Y-KICCSA 2008. Gervasi O, ed. 2008:204-219.
[13]
Martinez-Romo J,Araujo L.Web spam identification through language model analysis.Proceedings of AIRWeb '09; 2009Madrid, Spain; 2009. 21.
[14]
Hayati P,Potdar V.Toward Spam 2.0: an evaluation of Web 2.0 anti-spam methods.Proceedings of the 7th IEEE International Conference on Industrial Informatics (INDIN 2009); 2009Cardiff, UK; 2009. 875.
[15]
Abernethy J,Chapelle O,Castillo C.Web spam identification through content and hyperlinks.Proceedings of AIRWeb '08; 2008Beijing, China; 2008. 41.
[16]
Liang C,Ru L,Zhu X.R-SpamRank: a spam detection algorithm based on link analysis.Journal of Computational Information Systems. 2007;3 (4): 1705-1712
[17]
Araujo L,Martinez-Romo J.Web spam detection: new classification features based on qualified link analysis and language models.IEEE Transactions on Information Forensics and Security. 2010;5 (3): 581-590
[18]
Castillo C,Donato D,Gionis A,Murdock V,Silvestri F.Know your neighbours: web spam detection using the web topology.Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2007New York, NY, USA; 2007. 423.
[19]
Geng G,Wang C,Li Q,Xu L,Jin X.Boosting the performance of web spam detection with ensemble under-sampling classification.Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007); 2007Hainan, China; 2007. 583.
[20]
Dai N,Davison BD,Qi X.Looking into the past to better classify web spam.Proceedings of AIRWeb '09; 2009Madrid, Spain; 2009. 1.
[21]
Niu X,Ma J,He Q,Wang S,Zhang D.Learning to detect web spam by genetic programming.Proceedings of the 11th International Conference on Web-age Information Management; 2010; 2010. 18.
[22]
Fetterly D,Manasse M,Najork M,Wiener J.A large-scale study of the evolution of web pages.Proceedings of the 12th International World Wide Web Conference; 2003Budapest, Hungary; 2003. 669.
[23]
Fetterly D,Manasse M,Najork M.Detecting phrase-level duplication on the world wide web.Proceedings of SIGIR '05; 2005New York, NY, USA; 2005. 170.
[24]
Fetterly D,Manasse M,Najork M.Spam, damn spam, and statistics: using statistical analysis to locate spam web pages.Proceedings of WebDB '04; 2004New York, NY, USA; 2004. 1.
[25]
Benczúr AA,Siklósi D,Szabó J,Bíró I,Fekete Z,Kurucz M,Pereszlényi A,Rácz S,Szabó A.Web spam: a survey with vision for the archivist.Proceedings of the 8th International Web Archiving Workshop (IWAW '08); 2008Aarhus, Denmark; 2008. 1.
[26]
SpirinNHanJ. Survey on web spam detection: principles and algorithms. https://wiki.engr.illinois.edu/download/attachments/188588798/WebSpamSurvey.pdf (accessed 4 December 2011).
[27]
Wahsheh HA,Al-Kabi MN.Detecting Arabic web spam.Proceedings of the 5th International Conference on Information Technology (ICIT 2011); 2011Amman, Jordan; 2011. 8.
[28]
Jaramh R,Saleh T,Khattab S,Farag I.Detecting Arabic spam web pages using content analysis.International Journal of Reviews in Computing. 2011;6:1-8
[29]
Al-Kabi MN,Wahsheh HA,Al-Eroud AF,Alsmadi IM.Combating Arabic web spam using content analysis.Proceedings of the 2011 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT); 2011Amman Jordan; 2011. 1.
[30]
Alsmadi I.The automatic evaluation of website metrics and state.International Journal of Web-Based Learning and Teaching Technologies. 2010;5 (4): 1-17
[31]
Al-Eroud AF,Al-Ramahi MA,Al-Kabi MN,Alsmadi IM,Al-Shawakfa EM.Evaluating Google queries based on language preferences.Journal of Information Science. 2011;37 (3): 282-292

Cited By

View all
  • (2021)Using Bayesian networks with hidden variables for identifying trustworthy users in social networksJournal of Information Science10.1177/016555151985759046:5(600-615)Online publication date: 25-Feb-2021
  • (2021)Detecting Arabic Spam Reviews in Social Networks Based on Classification AlgorithmsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/347611521:1(1-13)Online publication date: 1-Nov-2021
  • (2018)Efficient Feature Representation Based on the Effect of Words Frequency for Arabic Documents ClassificationProceedings of the 2nd International Conference on Telecommunications and Communication Engineering10.1145/3291842.3291900(397-401)Online publication date: 28-Nov-2018
  • Show More Cited By
  1. Content-based analysis to detect Arabic web spam

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Journal of Information Science
    Journal of Information Science  Volume 38, Issue 3
    June 2012
    54 pages

    Publisher

    Sage Publications, Inc.

    United States

    Publication History

    Published: 01 June 2012

    Author Tags

    1. Arabic content features
    2. Arabic web spam
    3. Arabic web spam detection
    4. content features
    5. web spam
    6. web spam detection

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 30 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)Using Bayesian networks with hidden variables for identifying trustworthy users in social networksJournal of Information Science10.1177/016555151985759046:5(600-615)Online publication date: 25-Feb-2021
    • (2021)Detecting Arabic Spam Reviews in Social Networks Based on Classification AlgorithmsACM Transactions on Asian and Low-Resource Language Information Processing10.1145/347611521:1(1-13)Online publication date: 1-Nov-2021
    • (2018)Efficient Feature Representation Based on the Effect of Words Frequency for Arabic Documents ClassificationProceedings of the 2nd International Conference on Telecommunications and Communication Engineering10.1145/3291842.3291900(397-401)Online publication date: 28-Nov-2018
    • (2017)The impact of indexing approaches on Arabic text classificationJournal of Information Science10.1177/016555151562503043:2(159-173)Online publication date: 1-Apr-2017
    • (2017)A unified score propagation model for web spam demotion algorithmInformation Retrieval10.1007/s10791-017-9307-920:6(547-574)Online publication date: 1-Dec-2017

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media