[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1244408.1244412acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesiea-aeiConference Proceedingsconference-collections
Article

Improving web spam classifiers using link structure

Published: 08 May 2007 Publication History

Abstract

Web spam has been recognized as one of the top challenges in the search engine industry [14]. A lot of recent work has addressed the problem of detecting or demoting web spam, including both content spam [16, 12] and link spam [22, 13]. However, any time an anti-spam technique is developed, spammers will design new spamming techniques to confuse search engine ranking methods and spam detection mechanisms. Machine learning-based classification methods can quickly adapt to newly developed spam techniques. We describe a two-stage approach to improve the performance of common classifiers. We first implement a classifier to catch a large portion of spam in our data. Then we design several heuristics to decide if a node should be relabeled based on the preclassified result and knowledge about the neighborhood. Our experimental results show visible improvements with respect to precision and recall.

References

[1]
E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. So. The connectivity sonar: Detecting site functionality by structural patterns. In Proc. 14th ACM Conf. on Hypertext and Hypermedia, 2003.
[2]
L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of Web Spam. In Workshop on Advers. Inf. Retrieval on the Web, Aug. 2006.
[3]
A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. Spamrank - fully automatic link spam detection. In Workshop on Advers. Inf. Retrieval on the Web, 2005.
[4]
A. Benczúr, K. C. T., and Sarlós. Link-based similarity search to fight web spam. In Workshop on Advers. Inf. Retrieval on the Web, 2006.
[5]
C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Web spam detection using the web topology. Technical report, Yahoo! Research Barcelona, Nov. 2006.
[6]
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proc. ACM SIGMOD Int. Conf. on Management of Data, 1998.
[7]
B. Davison. Recognizing nepotistic links on the web. In Workshop on Artificial Intelligence for Web Search, 2000.
[8]
B. Davison. Topical locality in the web. In Proc. 23rd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2000.
[9]
I. Dorst and T. Scheffer. Thwarting the nigritude ultramarine: Learning to identify link spam. In Proc. European Conf. on Machine Learning, 2005.
[10]
D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of near-duplicate web pages. In Proc. 1st Latin American Web Congress, 2003.
[11]
D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In Proc. 7th Int. Workshop on the Web and Databases, pages 1--6, 2004.
[12]
Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In Workshop on Advers. Inf. Retrieval on the Web, 2005.
[13]
Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proc. 30th VLDB, 2004.
[14]
M. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. SIGIR Forum, 36(2):11--22, 2002.
[15]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 46(5):604--632, 1999.
[16]
A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proc. 15th WWW, pages 83--92, 2006.
[17]
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998.
[18]
V. Shkapenyuk and T. Suel. Design and implementation of a high-performance distributed web crawler. In Int. Conf. on Data Engineering, 2002.
[19]
M. Sobek. PRO - Google's PageRank 0 penalty, 2002.
[20]
I. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005.
[21]
B. Wu and B. Davison. Identifying link farm spam pages. In Proc. 14th WWW, May 2005.
[22]
B. Wu and B. Davison. Detecting semantic cloaking on the web. In Proc. 15th WWW, pages 819--828, 2006.
[23]
B. Wu, V. Goel, and B. Davison. Propagating trust and distrust to demote Web spam. In Workshop on Models of Trust and the Web, 2006.
[24]
H. Zhang, A. Goel, R. Govindan, K. Mason, and B. V. Roy. Making eigenvector-based reputation systems robust to collusion. In Proc. 3rd Workshop on Web Graphs, 2004.

Cited By

View all
  • (2021)An Improved Framework for Content- and Link-Based Web-Spam DetectionComplexity10.1155/2021/66257392021Online publication date: 1-Jan-2021
  • (2019)Purchased FameProceedings of the 2019 ACM Asia Conference on Computer and Communications Security10.1145/3321705.3329830(366-378)Online publication date: 2-Jul-2019
  • (2018)FS2RNN: Feature Selection Scheme for Web Spam Detection Using Recurrent Neural Networks2018 IEEE Global Communications Conference (GLOBECOM)10.1109/GLOCOM.2018.8647294(1-6)Online publication date: 9-Dec-2018
  • Show More Cited By

Index Terms

  1. Improving web spam classifiers using link structure

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    AIRWeb '07: Proceedings of the 3rd international workshop on Adversarial information retrieval on the web
    May 2007
    98 pages
    ISBN:9781595937322
    DOI:10.1145/1244408
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 May 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. classification
    2. link analysis
    3. machine learning
    4. search engines
    5. web mining
    6. web spam detection

    Qualifiers

    • Article

    Conference

    AIRWeb'07

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 26 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)An Improved Framework for Content- and Link-Based Web-Spam DetectionComplexity10.1155/2021/66257392021Online publication date: 1-Jan-2021
    • (2019)Purchased FameProceedings of the 2019 ACM Asia Conference on Computer and Communications Security10.1145/3321705.3329830(366-378)Online publication date: 2-Jul-2019
    • (2018)FS2RNN: Feature Selection Scheme for Web Spam Detection Using Recurrent Neural Networks2018 IEEE Global Communications Conference (GLOBECOM)10.1109/GLOCOM.2018.8647294(1-6)Online publication date: 9-Dec-2018
    • (2016)Analysis of Web Spam for Non-English Content: Toward More Effective Language-Based ClassifiersPLOS ONE10.1371/journal.pone.016438311:11(e0164383)Online publication date: 17-Nov-2016
    • (2015)Spam detection through link authorization from neighboring nodes2015 Forth International Conference on e-Technologies and Networks for Development (ICeND)10.1109/ICeND.2015.7328538(1-6)Online publication date: Sep-2015
    • (2015)Comprehensive Literature Review on Machine Learning Structures for Web Spam ClassificationProcedia Computer Science10.1016/j.procs.2015.10.06970(434-441)Online publication date: 2015
    • (2015)Identification of Underestimated and Overestimated Web Pages Using PageRank and Web Usage Mining MethodsTransactions on Computational Collective Intelligence XVIII10.1007/978-3-662-48145-5_7(127-146)Online publication date: 31-Jul-2015
    • (2015)A link graph-based approach to identify forum spamSecurity and Communication Networks10.1002/sec.9708:2(176-188)Online publication date: 25-Jan-2015
    • (2014)Spammer Classification Using Ensemble Methods over Structural Social Network FeaturesProceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 0210.1109/WI-IAT.2014.133(454-458)Online publication date: 11-Aug-2014
    • (2014)A study on health care consumers’ diabetes term usage across identified categoriesAslib Journal of Information Management10.1108/AJIM-01-2014-000866:4(443-463)Online publication date: 15-Jul-2014
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media