More Web Proxy on the site http://driver.im/

research-article

Looking into the past to better classify web spam

Authors:

Brian D. Davison,

Xiaoguang QiAuthors Info & Claims

AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web

Pages 1 - 8

https://doi.org/10.1145/1531914.1531916

Published: 21 April 2009 Publication History

Abstract

Web spamming techniques aim to achieve undeserved rankings in search results. Research has been widely conducted on identifying such spam and neutralizing its influence. However, existing spam detection work only considers current information. We argue that historical web page information may also be important in spam classification. In this paper, we use content features from historical versions of web pages to improve spam classification. We use supervised learning techniques to combine classifiers based on current page content with classifiers based on temporal features. Experiments on the WEBSPAM-UK2007 dataset show that our approach improves spam classification F-measure performance by 30% compared to a baseline classifier which only considers current page content.

References

[1]

A. Acharya, M. Cutts, J. Dean, P. Haahr, M. Henzinger, U. Hoelzle, S. Lawrence, K. Pfleger, O. Sercinoglu, and S. Tong. Information retrieval based on historical data. United States Patent 20050071741, USPTO, Mar. 2005.

[2]

R. Andersen, C. Borgs, J. Chayes, J. Hopcroft, K. Jain, V. Mirrokni, and S. Teng. Robust pagerank and locally computable spam detection features. In Proceedings of the Third International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pages 69--76, Apr. 2008.

Digital Library

[3]

J. Attenberg and T. Suel. Cleaning search results using term distance features. In Proceedings of the Fourth International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pages 21--24, Apr. 2008.

Digital Library

[4]

L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of web spam. In Proceedings of the Second International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pages 1--8, Aug. 2006.

[5]

A. A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. SpamRank -- Fully automatic link spam detection. In Proceedings of the First International Workshop on Adversarial Information Retrieval (AIRWeb), May 2005.

[6]

I. Biro, J. Szabo, and A. Benczur. Latent dirichlet allocation in web spam filtering. In Proceedings of the Fourth International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pages 21--24, Apr. 2008.

Digital Library

[7]

D. Cai, X. He, J.-R. Wen, and W.-Y. Ma. Block-level link analysis. In Proc. of the 27th Annual Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval, Sheffield, UK, July 2004.

Digital Library

[8]

B. D. Davison. Recognizing nepotistic links on the Web. In Artificial Intelligence for Web Search, pages 23--28. AAAI Press, July 2000. Presented at the AAAI-2000 workshop on Artificial Intelligence for Web Search, Technical Report WS-00-01.

[9]

Google Inc. Google home page. http://www.google.com/, 2009.

[10]

Z. Gyöngyi, P. Berkhin, H. Garcia-Molina, and J. Pedersen. Link spam detection based on mass estimation. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB), pages 63--72, Seoul, Korea, 2006.

Digital Library

[11]

Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), Toronto, Canada, 2004.

Digital Library

[12]

Internet Archive. The Internet Archive. http://www.archive.org/, 2009.

[13]

T. Joachims. Making large-scale support vector machine learning practical. In Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA, 1998.

Digital Library

[14]

Y.-R. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. L. Tseng. Splog detection using self-similarity analysis on blog temporal dynamics. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pages 1--8, New York, NY, 2007. ACM Press.

Digital Library

[15]

A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International Conference on the World Wide Web, May 2006.

Digital Library

[16]

The dmoz Open Directory Project (ODP), 2009. http://www.dmoz.org/.

[17]

S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proc. of the 17th Annual Int'l ACM SIGIR Conf. on Research and Development in Info. Retrieval, pages 232--241, 1994.

Digital Library

[18]

G. Shen, B. Gao, T.-Y. Liu, G. Feng, S. Song, and H. Li. Detecting link spam using temporal information. In Proc. of IEEE International Conference on Data Mining (ICDM), pages 1049--1053, 2006.

Digital Library

[19]

T. Urvoy, E. Chauveau, P. Filoche, and T. Lavergne. Tracking web spam with HTML style similarities. ACM Transactions on the Web, 2(3), 2008.

Digital Library

[20]

I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, 2005. Second edition.

Digital Library

[21]

B. Wu and B. D. Davison. Identifying link farm spam pages. In Proceedings of the 14th International World Wide Web Conference, pages 820--829, Chiba, Japan, May 2005.

Digital Library

[22]

B. Wu and B. D. Davison. Detecting semantic cloaking on the web. In Proceedings of the 15th International World Wide Web Conference, pages 819--828, Edinburgh, Scotland, May 2006.

Digital Library

[23]

B. Wu, V. Goel, and B. D. Davison. Topical TrustRank: Using topicality to combat web spam. In Proceedings of the 15th International World Wide Web Conference, pages 63--72, Edinburgh, Scotland, May 2006.

Digital Library

Cited By

Kulkarni ASolani DSanghavi PKunchapu AVijayalakshmi MNair S(2022)Discernment of Unsolicited Internet Spamdexing Using Graph TheoryInventive Systems and Control10.1007/978-981-19-1012-8_4(49-64)Online publication date: 2-Aug-2022
https://doi.org/10.1007/978-981-19-1012-8_4
Shahzad ANawi NRehman MKhan A(2021)An Improved Framework for Content- and Link-Based Web-Spam DetectionComplexity10.1155/2021/66257392021Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1155/2021/6625739
Niu XLiu GYang Q(2020)Trustworthy Website Detection Based on Social Hyperlink Network AnalysisIEEE Transactions on Network Science and Engineering10.1109/TNSE.2018.28660667:1(54-65)Online publication date: 1-Jan-2020
https://doi.org/10.1109/TNSE.2018.2866066
Show More Cited By

Index Terms

Looking into the past to better classify web spam
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Document analysis
2. Information systems
  1. Information retrieval
  2. World Wide Web
    1. Web applications
    2. Web services

Recommendations

Web Spam: A Study of the Page Language Effect on the Spam Detection Features
ICMLA '12: Proceedings of the 2012 11th International Conference on Machine Learning and Applications - Volume 02

Although search engines have deployed various techniques to detect and filter out Web spam, Web stammers continue to develop new tactics to influence the result of search engines ranking algorithms, for the purpose of obtaining an undeservedly high ...
Spam classification using supervised learning techniques
A2CWiC '10: Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing in India

Spam message is one of the major problems in today's Internet, which brings financial damage to companies and annoying individual users. Spam filtering is able to control the problem in a variety of ways. Many researches in spam filtering have been ...
Google Penguin: Evasion in Non-English Languages and a New Classifier
ICMLA '13: Proceedings of the 2013 12th International Conference on Machine Learning and Applications - Volume 02

Web spam techniques aim to mislead search engines so that web spam pages get ranked higher than they deserve. This leads to misleading search results as spam pages might appear in search results although the content of these spam pages might not be ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web

April 2009

67 pages

ISBN:9781605584386

DOI:10.1145/1531914

Editors:
Dennis Fetterly
Microsoft Research
,
Zoltán Gyöngyi
Google Research

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 April 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

AIRWeb '09

AIRWeb '09: AIRWeb '09, 5th International Workshop on Adversarial Information Retrieval on the Web

April 21, 2009

Madrid, Spain

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
383
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kulkarni ASolani DSanghavi PKunchapu AVijayalakshmi MNair S(2022)Discernment of Unsolicited Internet Spamdexing Using Graph TheoryInventive Systems and Control10.1007/978-981-19-1012-8_4(49-64)Online publication date: 2-Aug-2022
https://doi.org/10.1007/978-981-19-1012-8_4
Shahzad ANawi NRehman MKhan A(2021)An Improved Framework for Content- and Link-Based Web-Spam DetectionComplexity10.1155/2021/66257392021Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1155/2021/6625739
Niu XLiu GYang Q(2020)Trustworthy Website Detection Based on Social Hyperlink Network AnalysisIEEE Transactions on Network Science and Engineering10.1109/TNSE.2018.28660667:1(54-65)Online publication date: 1-Jan-2020
https://doi.org/10.1109/TNSE.2018.2866066
Luckner MKozik R(2019)Practical Web Spam Lifelong Machine Learning System with Automatic Adjustment to Current Lifecycle PhaseSecurity and Communication Networks10.1155/2019/65870202019Online publication date: 20-Feb-2019
https://dl.acm.org/doi/10.1155/2019/6587020
Gruss DSchwarz MWübbeling MGuggi SMalderle TMore SLipp MKim JAhn GKim SKim YLopez JKim T(2018)Use-After-FreeMailProceedings of the 2018 on Asia Conference on Computer and Communications Security10.1145/3196494.3196514(297-311)Online publication date: 29-May-2018
https://dl.acm.org/doi/10.1145/3196494.3196514
Makkar AObaidat MKumar N(2018)FS2RNN: Feature Selection Scheme for Web Spam Detection Using Recurrent Neural Networks2018 IEEE Global Communications Conference (GLOBECOM)10.1109/GLOCOM.2018.8647294(1-6)Online publication date: 9-Dec-2018
https://dl.acm.org/doi/10.1109/GLOCOM.2018.8647294
ROUL RASTHANA SSHAH MPARIKH D(2016)Detecting spam web pages using content and link-based techniquesSadhana10.1007/s12046-015-0460-941:2(193-202)Online publication date: 10-Mar-2016
https://doi.org/10.1007/s12046-015-0460-9
Wang DPu C(2015)BEANProceedings of the 2015 IEEE International Conference on Information Reuse and Integration10.1109/IRI.2015.69(403-410)Online publication date: 13-Aug-2015
https://dl.acm.org/doi/10.1109/IRI.2015.69
Goh KSingh A(2015)Comprehensive Literature Review on Machine Learning Structures for Web Spam ClassificationProcedia Computer Science10.1016/j.procs.2015.10.06970(434-441)Online publication date: 2015
https://doi.org/10.1016/j.procs.2015.10.069
Fdez-Glez JRuano-Ordas DMéndez JFdez-Riverola FLaza RPavón R(2015)A dynamic model for integrating simple web spam classification techniquesExpert Systems with Applications: An International Journal10.1016/j.eswa.2015.06.04342:21(7969-7978)Online publication date: 30-Nov-2015
https://dl.acm.org/doi/10.1016/j.eswa.2015.06.043
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents