[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1060745.1060839acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
Article

Scaling link-based similarity search

Published: 10 May 2005 Publication History

Abstract

To exploit the similarity information hidden in the hyperlink structure of the web, this paper introduces algorithms scalable to graphs with billions of vertices on a distributed architecture. The similarity of multi-step neighborhoods of vertices are numerically evaluated by similarity functions including SimRank [20], a recursive refinement of cocitation; PSimRank, a novel variant with better theoretical characteristics; and the Jaccard coefficient, extended to multi-step neighborhoods. Our methods are presented in a general framework of Monte Carlo similarity search algorithms that precompute an index database of random fingerprints, and at query time, similarities are estimated from the fingerprints. The performance and quality of the methods were tested on the Stanford Webbase [19] graph of 80M pages by comparing our scores to similarities extracted from the ODP directory [26]. Our experimental results suggest that the hyperlink structure of vertices within four to five steps provide more adequate information for similarity search than single-step neighborhoods.

References

[1]
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999.
[2]
Z. Bar-Yossef, A. Berg, S. Chien, J. Fakcharoenphol, and D. Weitz. Approximating aggregate queries about web pages via random walks. Proc. of VLDB, 2000.
[3]
Z. Bar-Yossef, A. Z. Broder, R. Kumar, and A. Tomkins. Sic transit gloria telae: towards an understanding of the web's decay. Proc. of WWW13, 2004.
[4]
A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Proc. of WWW9, 2000.
[5]
A. Z. Broder. On the resemblance and containment of documents. Proc. of Compression and Complexity of Sequences (SEQUENCES'97), pages 21--29. IEEE Computer Society, 1997.
[6]
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60(3):630--659, 2000.
[7]
A. Z. Broder, S. C. Glassman, M. S. Manasse, G. Zweig. Syntactic clustering of the Web. Proc. of WWW6, 1997.
[8]
S. Chakrabarti, B. E. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. Proc. of SIGMOD, 1998.
[9]
Y.-Y. Chen, Q. Gan, and T. Suel. I/O-efficient techniques for computing PageRank. Proc. of CIKM, 2002.
[10]
E. Cohen. Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci., 55(3):441--453, 1997.
[11]
J. Dean and M. R. Henzinger. Finding related pages in the World Wide Web. Computer Networks (Amsterdam, Netherlands: 1999), 31(11--16):1467--1479, 1999.
[12]
D. Fogaras and B. Rácz. A scalable randomized method to compute link-based similarity rank on the web graph. Proc. of Clustering Information over the Web (ClustWeb), in conjunction with EDBT, 2004.
[13]
D. Fogaras and B. Rácz. Scaling link-based similarity search. Technical report, MTA SZTAKI, 2004. http://www.ilab.sztaki.hu/websearch/Publications/.
[14]
D. Fogaras and B. Rácz. Towards scaling fully personalized PageRank. Proc. of Third Workshop on Algorithms and Models for the Web-Graph (WAW) in conjunction with FOCS, 2004.
[15]
T. H. Haveliwala. Efficient computation of PageRank. Technical Report 1999-31, Stanford University, 1999.
[16]
T. H. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating strategies for similarity search on the web. Proc. of WWW11, 2002.
[17]
T. H. Haveliwala, S. Kamvar, and G. Jeh. An analytical comparison of approaches to personalizing PageRank. Technical report, Stanford University, 2003.
[18]
M. R. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork. On near-uniform url sampling. Proc. of WWW9, 2000.
[19]
J. Hirai, S. Raghavan, H. Garcia-Molina, and A. Paepcke. Webbase: a repository of web pages. Proc. of WWW9, 2000.
[20]
G. Jeh and J. Widom. SimRank: A measure of structural-context similarity. Proc. of SIGKDD, 2002.
[21]
J. Kleinberg. Authoritative sources in a hyperlinked environment. J. of the ACM, 46(5):604--632, 1999.
[22]
D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. Proc. of CIKM, pages 556--559. ACM Press, 2003.
[23]
W. Lu, J. Janssen, E. Milios, and N. Japkowicz. Node similarity in networked information spaces. Proc. of 2001 conference of the Centre for Advanced Studies on Collaborative research, page~11. IBM Press, 2001.
[24]
U. Meyer, P. Sanders, and J. Sibeyn. Algorithms for Memory Hierarchies, Advanced Lectures. LNCS, Springer, 2003.
[25]
R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
[26]
Open Directory Project (ODP). http://www.dmoz.org.
[27]
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project, 1998.
[28]
C. R. Palmer, P. B. Gibbons, and C. Faloutsos. ANF: a fast and scalable tool for data mining in massive graphs. Proc. of SIGKDD, 2002.
[29]
P. Rusmevichientong, D. M. Pennock, S. Lawrence, and C. L. Giles. Methods for sampling pages uniformly from the world wide web. Proc. of AAAI Fall Symposium on Using Uncertainty Within Computation, pages 121--128, 2001.

Cited By

View all
  • (2023)Efficient and Accurate SimRank-Based Similarity Joins: Experiments, Analysis, and ImprovementProceedings of the VLDB Endowment10.14778/3636218.363621917:4(617-629)Online publication date: 1-Dec-2023
  • (2023)ClipSim: A GPU-friendly Parallel Framework for Single-Source SimRank with Accuracy GuaranteeProceedings of the ACM on Management of Data10.1145/35887071:1(1-26)Online publication date: 30-May-2023
  • (2023)All-Pairs SimRank Updates on Dynamic Graphs2023 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom59178.2023.00050(131-138)Online publication date: 21-Dec-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '05: Proceedings of the 14th international conference on World Wide Web
May 2005
781 pages
ISBN:1595930469
DOI:10.1145/1060745
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 May 2005

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. fingerprint
  2. link-analysis
  3. scalability
  4. similarity search

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)33
  • Downloads (Last 6 weeks)3
Reflects downloads up to 19 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Efficient and Accurate SimRank-Based Similarity Joins: Experiments, Analysis, and ImprovementProceedings of the VLDB Endowment10.14778/3636218.363621917:4(617-629)Online publication date: 1-Dec-2023
  • (2023)ClipSim: A GPU-friendly Parallel Framework for Single-Source SimRank with Accuracy GuaranteeProceedings of the ACM on Management of Data10.1145/35887071:1(1-26)Online publication date: 30-May-2023
  • (2023)All-Pairs SimRank Updates on Dynamic Graphs2023 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom59178.2023.00050(131-138)Online publication date: 21-Dec-2023
  • (2023)Hierarchical All-Pairs SimRank CalculationDatabase Systems for Advanced Applications10.1007/978-3-031-30675-4_17(252-268)Online publication date: 15-Apr-2023
  • (2022)Scaling High-Quality Pairwise Link-Based Similarity Retrieval on Billion-Edge GraphsACM Transactions on Information Systems10.1145/349520940:4(1-45)Online publication date: 11-Jan-2022
  • (2022)Unsupervised Entity Resolution With Blocking and Graph AlgorithmsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.299106334:3(1501-1515)Online publication date: 1-Mar-2022
  • (2022)Efficient index-free SimRank similarity search in large graphs by discounting path lengthsExpert Systems with Applications10.1016/j.eswa.2022.117746206(117746)Online publication date: Nov-2022
  • (2021)DiskProceedings of the VLDB Endowment10.14778/3430915.343092514:3(351-363)Online publication date: 9-Dec-2021
  • (2021)AdaSimProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3482316(1528-1537)Online publication date: 26-Oct-2021
  • (2021)Unified and Incremental SimRank: Index-free Approximation with Scheduled PrincipleIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2021.3111734(1-1)Online publication date: 2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media