[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Link analysis for Web spam detection

Published: 03 March 2008 Publication History

Abstract

We propose link-based techniques for automatic detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. The use of Web spam is widespread and difficult to solve, mostly due to the large size of the Web which means that, in practice, many algorithms are infeasible.
We perform a statistical analysis of a large collection of Web pages. In particular, we compute statistics of the links in the vicinity of every Web page applying rank propagation and probabilistic counting over the entire Web graph in a scalable way. These statistical features are used to build Web spam classifiers which only consider the link structure of the Web, regardless of page contents. We then present a study of the performance of each of the classifiers alone, as well as their combined performance, by testing them over a large collection of Web link spam. After tenfold cross-validation, our best classifiers have a performance comparable to that of state-of-the-art spam classifiers that use content attributes, but are orthogonal to content-based methods.

References

[1]
Alon, N., Matias, Y., and Szegedy, M. 1999. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci. 58, 1, 137--147.
[2]
Angelova, R. and Weikum, G. 2006. Graph-based text classification: learn from your neighbors. Proceedings of the International ACM SIGIR Conference, 485--492.
[3]
Baeza-Yates, R., Boldi, P., and Castillo, C. 2006. Generalizing pagerank: Damping functions for link-based ranking algorithms. In Proceedings of ACM SIGIR. ACM Press. WA, 308--315.
[4]
Baeza-Yates, R., Castillo, C., and López, V. 2005. Pagerank increase under different collusion topologies. In 1st International Workshop on Adversarial Information Retrieval on the Web.
[5]
Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison Wesley.
[6]
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. 2006a. Link-based characterization and detection of Web Spam. In 2nd International Workshop on Adversarial Information Retrieval on the Web (AIRWeb). Seattle, WA.
[7]
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., and Baeza-Yates, R. 2006b. Using rank propagation and probabilistic counting for link-based spam detection. In Proceedings of the Workshop on Web Mining and Web Usage Analysis (WebKDD). ACM Press.
[8]
Benczúr, A. A., Csalogány, K., Sarlós, T., and Uher, M. 2005. Spamrank: Fully automatic link spam detection. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web. Chiba, Japan.
[9]
Boldi, P., Codenotti, B., Santini, M., and Vigna, S. 2004. Ubicrawler: A scalable fully distributed web crawler. Softw. Pract. Exp. 34, 8, 711--726.
[10]
Breiman, L. 1996. Bagging predictors. Machine Learn. 24, 2, 123--140.
[11]
Broder, A. and Mitzenmacher, M. 2003. Network applications of Bloom filters: A survey. Internet Math. 1, 4, 485--509.
[12]
Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., and Vigna, S. 2006a. A reference collection for web spam. SIGIR Forum 40, 2, 11--24.
[13]
Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., and Vigna, S. 2006b. A reference collection for web spam. SIGIR Forum 40, 2, 11--24.
[14]
Castillo, C., Donato, D., Gionis, A., Murdock, V., and Silvestri, F. 2007. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th Annual International ACM SIGIR Conference (SIGIR). ACM Press, 423--430.
[15]
Cohen, E. 1997. Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci. 55, 3, 441--453.
[16]
Costa, L., Rodrigues, F. A., Travieso, G., and Villas. 2005. Characterization of complex networks: A survey of measurements. URL: http://arxiv.org/abs/cond-mat/0505185.
[17]
da Costa-Carvalho, A. L., Chirita, P.-A., de Moura, E. S., Calado, P., and Nejdl, W. 2006. Site level noise removal for search engines. In Proceedings of the 15th International Conference on World Wide Web (WWW'06), ACM Press, 73--82.
[18]
Davison, B. D. 2000a. Recognizing nepotistic links on the Web. In Artificial Intelligence for Web Search. AAAI Press, TX, 23--28.
[19]
Davison, B. D. 2000b. Topical locality in the web. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 272--279.
[20]
Demetrescu, C., Finocchi, I., and Ribichini, A. 2006. Trading off space for passes in graph streaming problems. In Proceedings of the 7th Annual ACM-SIAM Symposium on Discrete Algorithms.
[21]
Drost, I. and Scheffer, T. 2005. Thwarting the nigritude ultramarine: learning to identify link spam. In Proceedings of the 16th European Conference on Machine Learning (ECML). Lecture Notes in Artificial Intelligence, vol. 3720. Springer, 233--243.
[22]
Durand, M. and Flajolet, P. 2003. Loglog counting of large cardinalities (extended abstract). In Proceedings of 11th Annual European Symposium on Algorithms. Lecture Notes in Computer Science, vol. 2832. Springer, 605--617.
[23]
Eiron, N., Curley, K. S., and Tomlin, J. A. 2004. Ranking the web frontier. In Proceedings of the 13th International Conference on World Wide Web. ACM Press, 309--318.
[24]
Feigenbaum, J., Kannan, S., Gregor, M. A., Suri, S., and Zhang, J. 2004. On graph problems in a semi-streaming model. In 31st International Colloquium on Automata, Languages and Programming.
[25]
Fetterly, D., Manasse, M., and Najork, M. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th Workshop on the Web and Databases (WebDB). Paris, France. 1--6.
[26]
Flajolet, P. and Martin, N. G. 1985. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31, 2, 182--209.
[27]
Gibson, D., Kumar, R., and Tomkins, A. 2005. Discovering large dense subgraphs in massive graphs. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB'05). 721--732.
[28]
Gomes, L. H., Almeida, R. B., Bettencourt, L. M. A., Almeida, V., and Almeida, J. M. 2005. Comparative graph theoretical characterization of networks of spam and legitimate email. URL: http://www.ceas.cc/papers-2005/131.pdf.
[29]
Gori, M. and Witten, I. 2005. The bubble of web visibility. Comm. ACM 48, 3, 115--117.
[30]
Gulli, A. and Signorini, A. 2005. The indexable Web is more than 11.5 billion pages. In Poster Proceedings of the 14th International Conference on World Wide Web. ACM Press, 902--903.
[31]
Gupta, S., Anderson, R. M., and May, R. M. 1989. Networks of sexual contacts: implications for the pattern of spread of hiv. AIDS 3, 12, 807--817.
[32]
Gyöngyi, Z., Berkhin, P., Garcia-Molina, H., and Pedersen, J. 2006. Link spam detection based on mass estimation. In Proceedings of the 32nd International Conference on Very Large Data Bases. ACM, 439--450.
[33]
Gyöngyi, Z. and Garcia-Molina, H. 2005. Web spam taxonomy. In Proceedings of the 1st International Workshop on Adversarial Information Retrieval on the Web.
[34]
Gyöngyi, Z., Garcia-Molina, H., and Pedersen, J. 2004. Combating Web spam with TrustRank. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann, 576--587.
[35]
Haveliwala, T. 1999. Efficient computation of pagerank. Tech. rep., Stanford University.
[36]
Henzinger, M. R., Raghavan, P., and Rajagopalan, S. 1999. Computing on data streams. In Dimacs Series in Discrete Mathematics and Theoretical Computer Science, 107--118.
[37]
Leskovec, J., Kleinberg, J., and Faloutsos, C. 2005. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceeding of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD'05). ACM Press, 177--187.
[38]
Lipton, R. J. and Naughton, J. F. 1989. Estimating the size of generalized transitive closures. In Proceedings of the 15th International Conference on Very Large Data Bases (VLDB'89). Morgan Kaufmann Publishers Inc., 165--171.
[39]
Lu, Q. and Getoor, L. 2003. Link-based classification. In Proceedings of the International Conference on Machine Learning.
[40]
Mitzenmacher, M. and Upfal, E. 2005. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press.
[41]
Morris, R. 1978. Counting large numbers of events in small registers. Comm. ACM 21, 10, 840--842.
[42]
Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. 2006. Detecting spam web pages through content analysis. In Proceedings of the World Wide Web Conference. Edinburgh, Scotland. 83--92.
[43]
Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The PageRank citation ranking: bringing order to the Web. Tech. rep., Stanford Digital Library Technologies Project.
[44]
Palmer, C. R., Gibbons, P. B., and Faloutsos, C. 2002. ANF: A fast and scalable tool for data mining in massive graphs. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press. 81--90.
[45]
Perkins, A. 2001. The classification of search engine spam. http://www.silverdisc.co.uk/articles/spam-classification/.
[46]
Qi, X. and Davison, B. D. 2006. Knowing a web page by the company it keeps. In Proceedings of the 15th ACM Conference on Information and Knowledge Management (CIKM). Arlington, VA. 228--237.
[47]
Shen, G., Gao, B., Liu, T.-Y., Feng, G., Song, S., and Li, H. 2006. Detecting link spam using temporal information. In Proceedings of the International Conference on Data Mining (ICDM). Hong Kong.
[48]
Vitter, J. S. 2001. External memory algorithms and data structures. ACM Comput. Surv. 33, 2, 209--271.
[49]
Witten, I. H. and Frank, E. 1999. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann.
[50]
Wu, B. and Davison, B. D. 2005. Identifying link farm spam pages. In Special Interest Tracks and Posters of the 14th International Conference on World Wide Web (WWW'05), ACM Press, 820--829.
[51]
Zhang, H., Goel, A., Govindan, R., Mason, K., and Van Roy, B. 2004. Making eigenvector-based reputation systems robust to collusion. In Proceedings of the 3rd Workshop on Web Graphs (WAW). Lecture Notes in Computer Science, vol. 3243. Springer, 92--104.
[52]
Zhang, T., Popescul, A., and Dom, B. 2006. Linear prediction models with graph regularization for web-page categorization. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06). ACM Press, 821--826.

Cited By

View all
  • (2023)A Fuzzy-Based Approach to Enhance Cyber Defence Security for Next-Generation IoTIEEE Internet of Things Journal10.1109/JIOT.2021.305332610:3(2079-2086)Online publication date: 1-Feb-2023
  • (2023)CESDAM: Centered subgraph data matrix for large graph representationPrinciples of Big Graph: In-depth Insight10.1016/bs.adcom.2021.09.005(1-38)Online publication date: 2023
  • (2023)A flexible PageRank-based graph embedding framework closely related to spectral eigenvector embeddingsJournal of Applied and Computational Topology10.1007/s41468-023-00129-68:5(1417-1444)Online publication date: 21-Jul-2023
  • Show More Cited By

Index Terms

  1. Link analysis for Web spam detection

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on the Web
    ACM Transactions on the Web  Volume 2, Issue 1
    February 2008
    280 pages
    ISSN:1559-1131
    EISSN:1559-114X
    DOI:10.1145/1326561
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 03 March 2008
    Accepted: 01 October 2007
    Revised: 01 September 2007
    Received: 01 March 2007
    Published in TWEB Volume 2, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Link analysis
    2. adversarial information retrieval

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)22
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)A Fuzzy-Based Approach to Enhance Cyber Defence Security for Next-Generation IoTIEEE Internet of Things Journal10.1109/JIOT.2021.305332610:3(2079-2086)Online publication date: 1-Feb-2023
    • (2023)CESDAM: Centered subgraph data matrix for large graph representationPrinciples of Big Graph: In-depth Insight10.1016/bs.adcom.2021.09.005(1-38)Online publication date: 2023
    • (2023)A flexible PageRank-based graph embedding framework closely related to spectral eigenvector embeddingsJournal of Applied and Computational Topology10.1007/s41468-023-00129-68:5(1417-1444)Online publication date: 21-Jul-2023
    • (2022)An Adaptive Social Spammer Detection Model With Semi-Supervised Broad LearningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.304785734:10(4622-4635)Online publication date: 1-Oct-2022
    • (2021)Locating highly connected clusters in large networks with HyperLogLog countersJournal of Complex Networks10.1093/comnet/cnab0239:2Online publication date: 29-Jul-2021
    • (2021)Sampling methods and estimation of triangle count distributions in large networksNetwork Science10.1017/nws.2021.2(1-23)Online publication date: 26-Feb-2021
    • (2021)Reliability Prediction for Health-Related Content: A Replicability StudyAdvances in Information Retrieval10.1007/978-3-030-72240-1_4(47-61)Online publication date: 28-Mar-2021
    • (2020)Scalable Anti-TrustRank with Qualified Site-level Seeds for Link-based Web Spam DetectionCompanion Proceedings of the Web Conference 202010.1145/3366424.3385773(593-602)Online publication date: 20-Apr-2020
    • (2020)Taxonomy of Link Based web Spammers using Mining Optimized PageRank Algorithm for e-Governance2020 International Conference on Intelligent Engineering and Management (ICIEM)10.1109/ICIEM48762.2020.9160317(155-159)Online publication date: Jun-2020
    • (2020)An efficient deep learning-based scheme for web spam detection in IoT environmentFuture Generation Computer Systems10.1016/j.future.2020.03.004Online publication date: Mar-2020
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media