More Web Proxy on the site http://driver.im/

research-article

Panther: Fast Top-k Similarity Search on Large Networks

Authors:

Juanzi LiAuthors Info & Claims

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 1445 - 1454

https://doi.org/10.1145/2783258.2783267

Published: 10 August 2015 Publication History

Abstract

Estimating similarity between vertices is a fundamental issue in network analysis across various domains, such as social networks and biological networks. Methods based on common neighbors and structural contexts have received much attention. However, both categories of methods are difficult to scale up to handle large networks (with billions of nodes). In this paper, we propose a sampling method that provably and accurately estimates the similarity between vertices. The algorithm is based on a novel idea of random path. Specifically, given a network, we perform R random walks, each starting from a randomly picked vertex and walking T steps. Theoretically, the algorithm guarantees that the sampling size R = O(2ε^-2 log₂ T) depends on the error-bound ε, the confidence level (1 -- δ), and the path length T of each random walk.

We perform extensive empirical study on a Tencent microblogging network of 1,000,000,000 edges. We show that our algorithm can return top-k similar vertices for any vertex in a network 300× faster than the state-of-the-art methods. We also use two applications-identity resolution and structural hole spanner finding--to evaluate the accuracy of the estimated similarities. Our results demonstrate that the proposed algorithm achieves clearly better performance than several alternative methods.

Supplementary Material

MP4 File (p1445.mp4)

Download
313.56 MB

References

[1]

K. Aoyama, K. Saito, H. Sawada, and N. Ueda. Fast approximate similarity search based on degree-reduced neighborhood graphs. In KDD'11, pages 1055--1063, 2011.

Digital Library

[2]

R. Baeza-Yates, B. Ribeiro-Neto, et al. Modern information retrieval, volume 463. ACM press, 1999.

Digital Library

[3]

V. D. Blondel, A. Gajardo, M. Heymans, P. Senellart, and P. Van Dooren. A measure of similarity between graph vertices: Applications to synonym extraction and web searching. SIAM review, 46(4):647--666, 2004.

Digital Library

[4]

R. S. Burt. Detecting role equivalence. Social Networks, 12(1):83--97, 1990.

[5]

R. S. Burt. Structural holes: The social structure of competition. Harvard university press, 2009.

[6]

Y. Dong, Y. Yang, J. Tang, Y. Yang, and N. V. Chawla. Inferring user demographics and social strategies in mobile social networks. In KDD'14, pages 15--24, 2014.

Digital Library

[7]

W. Feller. An introduction to probability theory and its applications, volume 2. John Wiley & Sons, 2008.

[8]

L. C. Freeman. A set of measures of centrality based on betweenness. Sociometry, pages 35--41, 1977.

[9]

L. C. Freeman. Centrality in social networks conceptual clarification. Social networks, 1(3):215--239, 1979.

[10]

Y. Fujiwara, M. Nakatsuji, H. Shiokawa, T. Mishima, and M. Onizuka. Efficient ad-hoc search for personalized pagerank. In SIGMOD'13, pages 445--456, 2013.

Digital Library

[11]

S. Gilpin, T. Eliassi-Rad, and I. Davidson. Guided learning for role discovery (glrd): framework, algorithms, and applications. In KDD'13, pages 113--121, 2013.

Digital Library

[12]

K. Henderson, B. Gallagher, T. Eliassi-Rad, H. Tong, S. Basu, L. Akoglu, D. Koutra, C. Faloutsos, and L. Li. Rolx: structural role extraction & mining in large graphs. In KDD'12, pages 1231--1239, 2012.

Digital Library

[13]

K. Henderson, B. Gallagher, L. Li, L. Akoglu, T. Eliassi-Rad, H. Tong, and C. Faloutsos. It's who you know: graph mining using recursive structural features. In KDD'11, pages 663--671, 2011.

Digital Library

[14]

P. W. Holland and S. Leinhardt. An exponential family of probability distributions for directed graphs. Journal of the american Statistical association, 76(373):33--50, 1981.

[15]

J. Hopcroft, T. Lou, and J. Tang. Who will follow you back? reciprocal relationship prediction. In CIKM'11, pages 1137--1146, 2011.

Digital Library

[16]

P. Jaccard. Étude comparative de le distribution florale dans une portion de alpes et du jura. Bulletin de la Société Vaudoise des Sciences Naturelles, 37:547--579, 1901.

[17]

G. Jeh and J. Widom. Simrank: a measure of structural-context similarity. In KDD'02, pages 538--543, 2002.

Digital Library

[18]

R. Jin, V. E. Lee, and H. Hong. Axiomatic ranking of network role similarity. In KDD'11, pages 922--930, 2011.

Digital Library

[19]

L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39--43, 1953.

[20]

M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14(1):10--25, 1963.

[21]

M. Kusumoto, T. Maehara, and K.-i. Kawarabayashi. Scalable similarity search for simrank. In SIGMOD'14, pages 325--336, 2014.

Digital Library

[22]

P. Lee, L. V. Lakshmanan, and J. X. Yu. On top-k structural similarity search. In ICDE'12, pages 774--785, 2012.

Digital Library

[23]

E. Leicht, P. Holme, and M. E. Newman. Vertex similarity in networks. Physical Review E, 73(2):026120, 2006.

[24]

F. Lorrain and H. C. White. Structural equivalence of individuals in social networks. The Journal of mathematical sociology, 1(1):49--80, 1971.

[25]

T. Lou and J. Tang. Mining structural hole spanners through information diffusion in social networks. In WWW'13, pages 837--848, 2013.

Digital Library

[26]

M. E. Newman. Finding community structure in networks using the eigenvectors of matrices. Physical review E, 74(3):036104, 2006.

[27]

J.-Y. Pan, H.-J. Yang, C. Faloutsos, and P. Duygulu. Automatic multimedia cross-modal correlation discovery. In KDD'04, pages 653--658, 2004.

Digital Library

[28]

M. Riondato and E. M. Kornaropoulos. Fast approximation of betweenness centrality through sampling. In WSD'14, pages 413--422, 2014.

Digital Library

[29]

R. A. Rossi and N. K. Ahmed. Role discovery in networks. IEEE TKDE, 2015.

[30]

P. Sarkar and A. W. Moore. Fast nearest-neighbor search in disk-resident graphs. In KDD'10, pages 513--522, 2010.

Digital Library

[31]

H. Small. Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for information Science, 24(4):265--269, 1973.

[32]

Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. VLDB'11, pages 992--1003, 2011.

Digital Library

[33]

J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Arnetminer: Extraction and mining of academic social networks. In KDD'08, pages 990--998, 2008.

Digital Library

[34]

C. E. Tsourakakis. Toward quantifying vertex similarity in networks. Internet Mathematics, 10(3--4):263--286, 2014.

[35]

V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264--280, 1971.

[36]

I. Wald and V. Havran. On building fast kd-trees for ray tracing, and on doing that in o (n log n). In Interactive Ray Tracing 2006, IEEE Symposium on, pages 61--69, 2006.

[37]

Y. Yang, J. Tang, C. W.-k. Leung, Y. Sun, Q. Chen, J. Li, and Q. Yang. Rain: Social role-aware information diffusion. In AAAI'14, 2014.

Cited By

Cinaglia P(2024)PyMulSim: a method for computing node similarities between multilayer networks via graph isomorphism networksBMC Bioinformatics10.1186/s12859-024-05830-625:1Online publication date: 13-Jun-2024
https://doi.org/10.1186/s12859-024-05830-6
Nguyen TQuoc Viet hung NNguyen THuynh TNguyen TWeidlich MYin H(2024)Manipulating Recommender Systems: A Survey of Poisoning Attacks and CountermeasuresACM Computing Surveys10.1145/367732857:1(1-39)Online publication date: 7-Oct-2024
https://dl.acm.org/doi/10.1145/3677328
Weintraub BKim JTao RNita-Rotaru COkhravi HTian DUjcich BLuo BLiao XXu JKirda ELie D(2024)Exploiting Temporal Vulnerabilities for Unauthorized Access in Intent-based NetworkingProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security10.1145/3658644.3670301(3630-3644)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3658644.3670301
Show More Cited By

Index Terms

Panther: Fast Top-k Similarity Search on Large Networks
1. Applied computing
  1. Law, social and behavioral sciences
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Fast and Flexible Top-k Similarity Search on Large Networks

Similarity search is a fundamental problem in network analysis and can be applied in many applications, such as collaborator recommendation in coauthor networks, friend recommendation in social networks, and relation prediction in medical information ...
SimRank: a measure of structural-context similarity
KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining

The problem of measuring "similarity" of objects arises in many applications, and many domain-specific measures have been developed, e.g., matching text across documents or computing overlap among item-sets. We propose a complementary approach, ...
DeepWalk: online learning of social representations
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

We present DeepWalk, a novel approach for learning latent representations of vertices in a network. These latent representations encode social relations in a continuous vector space, which is easily exploited by statistical models. DeepWalk generalizes ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 2015

2378 pages

ISBN:9781450336642

DOI:10.1145/2783258

General Chairs:
Longbing Cao
University of Technology, Sydney
,
Chengqi Zhang
University of Technology, Sydney
,
Program Chairs:
Thorsten Joachims
Cornell University
,
Geoff Webb
Monash University
,
Dragos D. Margineantu
Boeing Research
,
Graham Williams
Australian Taxation Office

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 August 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Natural Science Foundation of China
Tsinghua University Initiative Scientific Research Program
National High-tech R&D Program
National Basic Research Program of China

Conference

KDD '15

Sponsor:

KDD '15: The 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 10 - 13, 2015

NSW, Sydney, Australia

Acceptance Rates

KDD '15 Paper Acceptance Rate 160 of 819 submissions, 20%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

40
Total Citations
View Citations
1,134
Total Downloads

Downloads (Last 12 months)62
Downloads (Last 6 weeks)7

Reflects downloads up to 19 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Cinaglia P(2024)PyMulSim: a method for computing node similarities between multilayer networks via graph isomorphism networksBMC Bioinformatics10.1186/s12859-024-05830-625:1Online publication date: 13-Jun-2024
https://doi.org/10.1186/s12859-024-05830-6
Nguyen TQuoc Viet hung NNguyen THuynh TNguyen TWeidlich MYin H(2024)Manipulating Recommender Systems: A Survey of Poisoning Attacks and CountermeasuresACM Computing Surveys10.1145/367732857:1(1-39)Online publication date: 7-Oct-2024
https://dl.acm.org/doi/10.1145/3677328
Weintraub BKim JTao RNita-Rotaru COkhravi HTian DUjcich BLuo BLiao XXu JKirda ELie D(2024)Exploiting Temporal Vulnerabilities for Unauthorized Access in Intent-based NetworkingProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security10.1145/3658644.3670301(3630-3644)Online publication date: 2-Dec-2024
https://dl.acm.org/doi/10.1145/3658644.3670301
Saeedan MEldawy AZhao Z(2023)dsJSON: A Distributed SQL JSON ProcessorProceedings of the ACM on Management of Data10.1145/35889571:1(1-25)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588957
Zeighami SShahabi CSharan V(2023)NeuroSketch: Fast and Approximate Evaluation of Range Aggregate Queries with Neural NetworksProceedings of the ACM on Management of Data10.1145/35889541:1(1-26)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588954
Barbarioli BMersy GSintos SKrishnan S(2023)Hierarchical Residual Encoding for Multiresolution Time Series CompressionProceedings of the ACM on Management of Data10.1145/35889531:1(1-26)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588953
Lai ZLiu CLo E(2023)When Private Blockchain Meets Deterministic DatabaseProceedings of the ACM on Management of Data10.1145/35889521:1(1-28)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588952
Wu XZhang DZhang MGuo CYang BJensen C(2023)AutoCTS+: Joint Neural Architecture and Hyperparameter Search for Correlated Time Series ForecastingProceedings of the ACM on Management of Data10.1145/35889511:1(1-26)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588951
Du XZhang XWang SHuang Z(2023)Efficient Tree-SVD for Subset Node Embedding over Large Dynamic GraphsProceedings of the ACM on Management of Data10.1145/35889501:1(1-26)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588950
Zeng YTong YChen L(2023)LiteHST: A Tree Embedding based Method for Similarity SearchProceedings of the ACM on Management of Data10.1145/35887151:1(1-26)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588715
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents