[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1531914.1531916acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesiea-aeiConference Proceedingsconference-collections
research-article

Looking into the past to better classify web spam

Published: 21 April 2009 Publication History

Abstract

Web spamming techniques aim to achieve undeserved rankings in search results. Research has been widely conducted on identifying such spam and neutralizing its influence. However, existing spam detection work only considers current information. We argue that historical web page information may also be important in spam classification. In this paper, we use content features from historical versions of web pages to improve spam classification. We use supervised learning techniques to combine classifiers based on current page content with classifiers based on temporal features. Experiments on the WEBSPAM-UK2007 dataset show that our approach improves spam classification F-measure performance by 30% compared to a baseline classifier which only considers current page content.

References

[1]
A. Acharya, M. Cutts, J. Dean, P. Haahr, M. Henzinger, U. Hoelzle, S. Lawrence, K. Pfleger, O. Sercinoglu, and S. Tong. Information retrieval based on historical data. United States Patent 20050071741, USPTO, Mar. 2005.
[2]
R. Andersen, C. Borgs, J. Chayes, J. Hopcroft, K. Jain, V. Mirrokni, and S. Teng. Robust pagerank and locally computable spam detection features. In Proceedings of the Third International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pages 69--76, Apr. 2008.
[3]
J. Attenberg and T. Suel. Cleaning search results using term distance features. In Proceedings of the Fourth International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pages 21--24, Apr. 2008.
[4]
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of web spam. In Proceedings of the Second International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pages 1--8, Aug. 2006.
[5]
A. A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. SpamRank -- Fully automatic link spam detection. In Proceedings of the First International Workshop on Adversarial Information Retrieval (AIRWeb), May 2005.
[6]
I. Biro, J. Szabo, and A. Benczur. Latent dirichlet allocation in web spam filtering. In Proceedings of the Fourth International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pages 21--24, Apr. 2008.
[7]
D. Cai, X. He, J.-R. Wen, and W.-Y. Ma. Block-level link analysis. In Proc. of the 27th Annual Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval, Sheffield, UK, July 2004.
[8]
B. D. Davison. Recognizing nepotistic links on the Web. In Artificial Intelligence for Web Search, pages 23--28. AAAI Press, July 2000. Presented at the AAAI-2000 workshop on Artificial Intelligence for Web Search, Technical Report WS-00-01.
[9]
Google Inc. Google home page. http://www.google.com/, 2009.
[10]
Z. Gyöngyi, P. Berkhin, H. Garcia-Molina, and J. Pedersen. Link spam detection based on mass estimation. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB), pages 63--72, Seoul, Korea, 2006.
[11]
Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), Toronto, Canada, 2004.
[12]
Internet Archive. The Internet Archive. http://www.archive.org/, 2009.
[13]
T. Joachims. Making large-scale support vector machine learning practical. In Advances in Kernel Methods: Support Vector Machines. MIT Press, Cambridge, MA, 1998.
[14]
Y.-R. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. L. Tseng. Splog detection using self-similarity analysis on blog temporal dynamics. In Proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), pages 1--8, New York, NY, 2007. ACM Press.
[15]
A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th International Conference on the World Wide Web, May 2006.
[16]
The dmoz Open Directory Project (ODP), 2009. http://www.dmoz.org/.
[17]
S. E. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proc. of the 17th Annual Int'l ACM SIGIR Conf. on Research and Development in Info. Retrieval, pages 232--241, 1994.
[18]
G. Shen, B. Gao, T.-Y. Liu, G. Feng, S. Song, and H. Li. Detecting link spam using temporal information. In Proc. of IEEE International Conference on Data Mining (ICDM), pages 1049--1053, 2006.
[19]
T. Urvoy, E. Chauveau, P. Filoche, and T. Lavergne. Tracking web spam with HTML style similarities. ACM Transactions on the Web, 2(3), 2008.
[20]
I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, 2005. Second edition.
[21]
B. Wu and B. D. Davison. Identifying link farm spam pages. In Proceedings of the 14th International World Wide Web Conference, pages 820--829, Chiba, Japan, May 2005.
[22]
B. Wu and B. D. Davison. Detecting semantic cloaking on the web. In Proceedings of the 15th International World Wide Web Conference, pages 819--828, Edinburgh, Scotland, May 2006.
[23]
B. Wu, V. Goel, and B. D. Davison. Topical TrustRank: Using topicality to combat web spam. In Proceedings of the 15th International World Wide Web Conference, pages 63--72, Edinburgh, Scotland, May 2006.

Cited By

View all
  • (2022)Discernment of Unsolicited Internet Spamdexing Using Graph TheoryInventive Systems and Control10.1007/978-981-19-1012-8_4(49-64)Online publication date: 2-Aug-2022
  • (2021)An Improved Framework for Content- and Link-Based Web-Spam DetectionComplexity10.1155/2021/66257392021Online publication date: 1-Jan-2021
  • (2020)Trustworthy Website Detection Based on Social Hyperlink Network AnalysisIEEE Transactions on Network Science and Engineering10.1109/TNSE.2018.28660667:1(54-65)Online publication date: 1-Jan-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
AIRWeb '09: Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
April 2009
67 pages
ISBN:9781605584386
DOI:10.1145/1531914
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 April 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. archival web
  2. search engine spam
  3. spam classification
  4. temporal features

Qualifiers

  • Research-article

Conference

AIRWeb '09

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Discernment of Unsolicited Internet Spamdexing Using Graph TheoryInventive Systems and Control10.1007/978-981-19-1012-8_4(49-64)Online publication date: 2-Aug-2022
  • (2021)An Improved Framework for Content- and Link-Based Web-Spam DetectionComplexity10.1155/2021/66257392021Online publication date: 1-Jan-2021
  • (2020)Trustworthy Website Detection Based on Social Hyperlink Network AnalysisIEEE Transactions on Network Science and Engineering10.1109/TNSE.2018.28660667:1(54-65)Online publication date: 1-Jan-2020
  • (2019)Practical Web Spam Lifelong Machine Learning System with Automatic Adjustment to Current Lifecycle PhaseSecurity and Communication Networks10.1155/2019/65870202019Online publication date: 20-Feb-2019
  • (2018)Use-After-FreeMailProceedings of the 2018 on Asia Conference on Computer and Communications Security10.1145/3196494.3196514(297-311)Online publication date: 29-May-2018
  • (2018)FS2RNN: Feature Selection Scheme for Web Spam Detection Using Recurrent Neural Networks2018 IEEE Global Communications Conference (GLOBECOM)10.1109/GLOCOM.2018.8647294(1-6)Online publication date: 9-Dec-2018
  • (2016)Detecting spam web pages using content and link-based techniquesSadhana10.1007/s12046-015-0460-941:2(193-202)Online publication date: 10-Mar-2016
  • (2015)BEANProceedings of the 2015 IEEE International Conference on Information Reuse and Integration10.1109/IRI.2015.69(403-410)Online publication date: 13-Aug-2015
  • (2015)Comprehensive Literature Review on Machine Learning Structures for Web Spam ClassificationProcedia Computer Science10.1016/j.procs.2015.10.06970(434-441)Online publication date: 2015
  • (2015)A dynamic model for integrating simple web spam classification techniquesExpert Systems with Applications: An International Journal10.1016/j.eswa.2015.06.04342:21(7969-7978)Online publication date: 30-Nov-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media