I/O-efficient algorithms for top-k nearest keyword search in massive graphs

Qiankun Zhu¹,
Hong Cheng¹ &
Xin Huang²

485 Accesses
Explore all metrics

Abstract

Networks emerging nowadays usually have labels or textual content on the nodes. We model such commonly seen network as an undirected graph G, in which each node is attached with zero or more keywords, and each edge is assigned with a length. On such networks, a novel and useful query is called top-k nearest keyword (\(\mathsf {k\text {-}NK}\)) search. Given a query node q in G and a keyword \(\lambda \), a \(\mathsf {k\text {-}NK}\) query searches k nodes which contain \(\lambda \) and are nearest to q. The \(\mathsf {k\text {-}NK}\) problem has been studied recently in the literature. But most existing solutions assume that the graph as well as the constructed index can fit entirely in memory. As a result, they cannot be applied directly to very large-scale networks which are commonly found in practice, but cannot fit in memory. In this work, we design an I/O-efficient solution, which uses a compact disk index to answer a \(\mathsf {k\text {-}NK}\) query with constant I/Os. The key to an accurate \(\mathsf {k\text {-}NK}\) result is a precise shortest distance estimation in a graph. In our solution, we follow our previous work Qiao et al. (PVLDB 6:901–912, 2013) which uses the shortest path tree as an approximate representation of a graph and uses the tree distance between two nodes as an accurate estimation of the shortest distance between them on a graph. With such representation, the original \(\mathsf {k\text {-}NK}\) query on a graph can be reduced to answering the query on a set of trees and then assembling the results obtained from the trees. We exploit a compact tree-based index and study how to lay out the index to disk. We design a novel technique which decomposes the index tree into paths and subtrees and stores them in disk. Our theoretical analysis shows that the disk-based index is small in size and supports constant query I/Os. Extensive experimental study on massive trees and graphs with billions of edges and keywords verifies our theoretical findings and demonstrates the superiority of our method over the state-of-the-art methods in the literature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Notes

References

Kolahdouzan, M.R., Shahabi, C.: Voronoi-based k nearest neighbor search for spatial network databases. In: VLDB, pp. 840–851 (2004)
Samet, H., Sankaranarayanan, J., Alborzi, H.: Scalable network distance browsing in spatial databases. In: SIGMOD, pp. 43–54 (2008)
Sankaranarayanan, J., Samet, H.: Query processing using distance oracles for spatial networks. IEEE Trans. Knowl. Data Eng. 22(8), 1158–1175 (2010)
Article Google Scholar
Bahmani, B., Goel, A.: Partitioned multi-indexing: bringing order to social search. In: WWW, pp. 399–408 (2012)
Qiao, M., Qin, L., Cheng, H., Yu, J.X., Tian, W.: Top-k nearest keyword search on large graphs. PVLDB 6(10), 901–912 (2013)
Google Scholar
Jiang, M., Fu, A., Wong, R.: Exact top-k nearest keyword search in large networks. In: SIGMOD, pp. 393–404 (2015)
Sibeyn, J.F., Abello, J., Meyer, U.: Heuristics for semi-external depth first search on directed graphs. In: SPAA, pp. 282–292 (2002)
Zhang, Z., Qin, L., Yu, J.X.: Contract & expand: I/O efficient sccs computing. In: ICDE, pp. 208–219 (2014)
Kyrola, A., Blelloch, G., Guestrin, C.: GraphChi: large-scale graph computation on just a PC. OSDI 12, 31–46 (2012)
Google Scholar
Zhu, X., Han, W., Chen, W.: GridGraph: large-scale graph processing on a single machine using 2-level hierarchical partitioning. In: USENIX Annual Technical Conference, pp. 375–386 (2015)
Roy, A., Bindschaedler, L., Malicevic, J., Zwaenepoel, W.: Chaos: Scale-out graph processing from secondary storage. In: SOSP, pp. 410–424 (2015)
Maheshwari, A., Zeh, N.: A survey of techniques for designing I/O-efficient algorithms. In: Algorithms for Memory Hierarchies, pp. 36–61. Springer, Berlin (2003)
Hutchinson, D.A., Maheshwari, A., Zeh, N.: An external memory data structure for shortest path queries. In: COCOON, pp. 51–60 (1999)
Hu, X., Tao, Y., Chung, C.: Massive graph triangulation. In: SIGMOD, pp. 325–336 (2013)
Chu, S., Cheng, J.: Triangle listing in massive networks and its applications. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 21–24 August 2011, pp. 672–680 (2011)
Tao, Y., Papadopoulos, S., Sheng, C., Stefanidis, K.: Nearest keyword search in xml documents. In: SIGMOD, pp. 589–600 (2011)
Bender, M.A., Farach-colton, M.: The LCA problem revisited. In: Gonnet, G.H., Viola, A. (eds.) In Latin American Theoretical Informatics, pp. 88–94. Springer, Berlin (2000)
Michalis, P., Bonchi, F., Castillo, C., Gionis, A.: Fast shortest path distance estimation in large networks. In: CIKM, pp. 867–876 (2009)
Jon, K., Slivkins, A., Wexler, T.: Triangulation and embedding using small sets of beacons. In: IEEE Symposium on Foundations of Computer Science, pp. 444–453 (2004)
Vieira, M.V., Fonseca, B.M., Damazio, R., Golgher, P.B., Reis, D.D.C., Ribeiro-Neto, B.: Efficient search ranking in social networks. In: CIKM, pp. 563–572 (2007)
Dijkstra, E.W.: A note on two problems in connexion with graphs. Numer. Math. 1(1), 269–271 (1959)
Article MathSciNet MATH Google Scholar
Sarma, A.D., Gollapudi, S., Najork, M., Panigrahy, R.: A sketch-based distance oracle for web-scale graphs. In: WSDM, pp. 401–410 (2010)
Spearman, C.: The proof and measurement of association between two things. Am. J. Psychol. 15(1), 72–101 (1904)
Article Google Scholar
Hristidis, V., Papakonstantinou, Y.: Discover: keyword search in relational databases. In: VLDB, pp. 670–681 (2002)
Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., Sudarshan, S.: Keyword searching and browsing in databases using banks. In: ICDE, pp. 431–440 (2002)
Golenberg, K., Kimelfeld, B., Sagiv, Y.: Keyword proximity search in complex data graphs. In: SIGMOD, pp. 927–940 (2008)
Li, G., Ooi, B.C., Feng, J., Wang, J., Zhou, L.: Ease: efficient and adaptive keyword search on unstructured, semi-structured and structured data. In: SIGMOD, pp. 903–914 (2008)
Qin, L., Yu, J.X., Chang, L., Tao, Y.: Querying communities in relational databases. In: ICDE, pp. 724–735 (2009)
Kargar, M., An, A.: Keyword search in graphs: finding r-cliques. PVLDB 4(10), 681–692 (2011)
Google Scholar
Yu, J.X., Qin, L., Chang, L.: Keyword search in relational databases: a survey. IEEE Data Eng. Bull. 33(1), 67–78 (2010)
Google Scholar
Yao, B., Tang, M., Li, F.: Multi-approximate-keyword routing in gis data. In: GIS, pp. 201–210 (2011)
Cao, X., Chen, L., Cong, G., Xiao, X.: Keyword-aware optimal route search. PVLDB 5(11), 1136–1147 (2012)
Google Scholar
Tao, Y.: A dynamic I/O-efficient structure for one-dimensional top-k range reporting. In: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS’14, Snowbird, UT, USA, 22–27 June 2014, pp. 256–265 (2014)

Download references

Acknowledgements

The work was supported by The Chinese University of Hong Kong Direct Grant No. 4055048.

Author information

Authors and Affiliations

Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Sha Tin, Hong Kong
Qiankun Zhu & Hong Cheng
School of Data Science, Fudan University, Shanghai, China
Xin Huang

Authors

Qiankun Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Hong Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Xin Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Huang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, Q., Cheng, H. & Huang, X. I/O-efficient algorithms for top-k nearest keyword search in massive graphs. The VLDB Journal 26, 563–583 (2017). https://doi.org/10.1007/s00778-017-0464-7

Download citation

Received: 20 July 2016
Revised: 30 March 2017
Accepted: 11 May 2017
Published: 29 May 2017
Issue Date: August 2017
DOI: https://doi.org/10.1007/s00778-017-0464-7

I/O-efficient algorithms for top-k nearest keyword search in massive graphs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

COLD. Revisiting Hub Labels on the Database for Large-Scale Graphs

Hub Labels on the database for large-scale graphs with the COLD framework

Finding smallest k-Compact tree set for keyword queries on graphs using mapreduce

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

I/O-efficient algorithms for top-k nearest keyword search in massive graphs

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

COLD. Revisiting Hub Labels on the Database for Large-Scale Graphs

Hub Labels on the database for large-scale graphs with the COLD framework

Finding smallest k-Compact tree set for keyword queries on graphs using mapreduce

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation