More Web Proxy on the site http://driver.im/

Article

Improving web spam classifiers using link structure

Authors:

Torsten SuelAuthors Info & Claims

AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web

Pages 17 - 20

https://doi.org/10.1145/1244408.1244412

Published: 08 May 2007 Publication History

Abstract

Web spam has been recognized as one of the top challenges in the search engine industry [14]. A lot of recent work has addressed the problem of detecting or demoting web spam, including both content spam [16, 12] and link spam [22, 13]. However, any time an anti-spam technique is developed, spammers will design new spamming techniques to confuse search engine ranking methods and spam detection mechanisms. Machine learning-based classification methods can quickly adapt to newly developed spam techniques. We describe a two-stage approach to improve the performance of common classifiers. We first implement a classifier to catch a large portion of spam in our data. Then we design several heuristics to decide if a node should be relabeled based on the preclassified result and knowledge about the neighborhood. Our experimental results show visible improvements with respect to precision and recall.

References

[1]

E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. So. The connectivity sonar: Detecting site functionality by structural patterns. In Proc. 14th ACM Conf. on Hypertext and Hypermedia, 2003.

Digital Library

[2]

L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of Web Spam. In Workshop on Advers. Inf. Retrieval on the Web, Aug. 2006.

[3]

A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. Spamrank - fully automatic link spam detection. In Workshop on Advers. Inf. Retrieval on the Web, 2005.

[4]

A. Benczúr, K. C. T., and Sarlós. Link-based similarity search to fight web spam. In Workshop on Advers. Inf. Retrieval on the Web, 2006.

[5]

C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Web spam detection using the web topology. Technical report, Yahoo! Research Barcelona, Nov. 2006.

[6]

S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1998.

Digital Library

[7]

B. Davison. Recognizing nepotistic links on the web. In Workshop on Artificial Intelligence for Web Search, 2000.

[8]

B. Davison. Topical locality in the web. In Proc. 23rd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2000.

Digital Library

[9]

I. Dorst and T. Scheffer. Thwarting the nigritude ultramarine: Learning to identify link spam. In Proc. European Conf. on Machine Learning, 2005.

Digital Library

[10]

D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of near-duplicate web pages. In Proc. 1st Latin American Web Congress, 2003.

Digital Library

[11]

D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In Proc. 7th Int. Workshop on the Web and Databases, pages 1--6, 2004.

Digital Library

[12]

Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In Workshop on Advers. Inf. Retrieval on the Web, 2005.

[13]

Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proc. 30th VLDB, 2004.

Digital Library

[14]

M. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2):11--22, 2002.

Digital Library

[15]

J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604--632, 1999.

Digital Library

[16]

A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proc. 15th WWW, pages 83--92, 2006.

Digital Library

[17]

L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998.

[18]

V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Int. Conf. on Data Engineering, 2002.

Digital Library

[19]

M. Sobek. PRO - Google's PageRank 0 penalty, 2002.

[20]

I. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005.

Digital Library

[21]

B. Wu and B. Davison. Identifying link farm spam pages. In Proc. 14th WWW, May 2005.

Digital Library

[22]

B. Wu and B. Davison. Detecting semantic cloaking on the web. In Proc. 15th WWW, pages 819--828, 2006.

Digital Library

[23]

B. Wu, V. Goel, and B. Davison. Propagating trust and distrust to demote Web spam. In Workshop on Models of Trust and the Web, 2006.

[24]

H. Zhang, A. Goel, R. Govindan, K. Mason, and B. V. Roy. Making eigenvector-based reputation systems robust to collusion. In Proc. 3rd Workshop on Web Graphs, 2004.

Cited By

Shahzad ANawi NRehman MKhan A(2021)An Improved Framework for Content- and Link-Based Web-Spam DetectionComplexity10.1155/2021/66257392021Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1155/2021/6625739
Van Goethem TMiramirkhani NJoosen WNikiforakis NGalbraith SRussello GSusilo WGollmann DKirda ELiang Z(2019)Purchased FameProceedings of the 2019 ACM Asia Conference on Computer and Communications Security10.1145/3321705.3329830(366-378)Online publication date: 2-Jul-2019
https://dl.acm.org/doi/10.1145/3321705.3329830
Makkar AObaidat MKumar N(2018)FS2RNN: Feature Selection Scheme for Web Spam Detection Using Recurrent Neural Networks2018 IEEE Global Communications Conference (GLOBECOM)10.1109/GLOCOM.2018.8647294(1-6)Online publication date: 9-Dec-2018
https://dl.acm.org/doi/10.1109/GLOCOM.2018.8647294
Show More Cited By

Index Terms

Improving web spam classifiers using link structure
1. Information systems
  1. Information retrieval

Recommendations

Survey on web spam detection: principles and algorithms

Search engines became a de facto place to start information acquisition on the Web. Though due to web spam phenomenon, search results are not always as good as desired. Moreover, spam evolves that makes the problem of providing high quality search even ...
Content-based analysis to detect Arabic web spam

Search engines are important outlets for information query and retrieval. They have to deal with the continual increase of information available on the web, and provide users with convenient access to such huge amounts of information. Furthermore, with ...
Russian web spam evolution: yandex experience
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web

Web spam has a negative impact on the search quality and users' satisfaction and forces search engines to waste resources to crawl, index, and rank it. Thus search engines are compelled to make significant efforts in order to fight web spam. Traffic ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web

May 2007

98 pages

ISBN:9781595937322

DOI:10.1145/1244408

Conference Chairs:
Carlos Castillo
Yahoo! Research
,
Kumar Chellapilla
Microsoft Live Labs
,
Brian D. Davison
Lehigh University

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 May 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

AIRWeb'07

AIRWeb'07: AIRWeb'07, Third International Workshop on Adversarial Information Retrieval on the Web

May 8, 2007

Alberta, Banff, Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
603
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 26 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shahzad ANawi NRehman MKhan A(2021)An Improved Framework for Content- and Link-Based Web-Spam DetectionComplexity10.1155/2021/66257392021Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1155/2021/6625739
Van Goethem TMiramirkhani NJoosen WNikiforakis NGalbraith SRussello GSusilo WGollmann DKirda ELiang Z(2019)Purchased FameProceedings of the 2019 ACM Asia Conference on Computer and Communications Security10.1145/3321705.3329830(366-378)Online publication date: 2-Jul-2019
https://dl.acm.org/doi/10.1145/3321705.3329830
Makkar AObaidat MKumar N(2018)FS2RNN: Feature Selection Scheme for Web Spam Detection Using Recurrent Neural Networks2018 IEEE Global Communications Conference (GLOBECOM)10.1109/GLOCOM.2018.8647294(1-6)Online publication date: 9-Dec-2018
https://dl.acm.org/doi/10.1109/GLOCOM.2018.8647294
Alsaleh MAlarifi A(2016)Analysis of Web Spam for Non-English Content: Toward More Effective Language-Based ClassifiersPLOS ONE10.1371/journal.pone.016438311:11(e0164383)Online publication date: 17-Nov-2016
https://doi.org/10.1371/journal.pone.0164383
Eugene OFengli ZAdu-Boahen OYellakuor B(2015)Spam detection through link authorization from neighboring nodes2015 Forth International Conference on e-Technologies and Networks for Development (ICeND)10.1109/ICeND.2015.7328538(1-6)Online publication date: Sep-2015
https://doi.org/10.1109/ICeND.2015.7328538
Goh KSingh A(2015)Comprehensive Literature Review on Machine Learning Structures for Web Spam ClassificationProcedia Computer Science10.1016/j.procs.2015.10.06970(434-441)Online publication date: 2015
https://doi.org/10.1016/j.procs.2015.10.069
Kapusta JMunk MDrlík M(2015)Identification of Underestimated and Overestimated Web Pages Using PageRank and Web Usage Mining MethodsTransactions on Computational Collective Intelligence XVIII10.1007/978-3-662-48145-5_7(127-146)Online publication date: 31-Jul-2015
https://doi.org/10.1007/978-3-662-48145-5_7
Shin YMyers SGupta MRadivojac P(2015)A link graph-based approach to identify forum spamSecurity and Communication Networks10.1002/sec.9708:2(176-188)Online publication date: 25-Jan-2015
https://dl.acm.org/doi/10.1002/sec.970
Bhat SAbulaish MMirza A(2014)Spammer Classification Using Ensemble Methods over Structural Social Network FeaturesProceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 0210.1109/WI-IAT.2014.133(454-458)Online publication date: 11-Aug-2014
https://dl.acm.org/doi/10.1109/WI-IAT.2014.133
Zhang JZhao YDimitroff A(2014)A study on health care consumers’ diabetes term usage across identified categoriesAslib Journal of Information Management10.1108/AJIM-01-2014-000866:4(443-463)Online publication date: 15-Jul-2014
https://doi.org/10.1108/AJIM-01-2014-0008
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents