Abstract
Networks emerging nowadays usually have labels or textual content on the nodes. We model such commonly seen network as an undirected graph G, in which each node is attached with zero or more keywords, and each edge is assigned with a length. On such networks, a novel and useful query is called top-k nearest keyword (\(\mathsf {k\text {-}NK}\)) search. Given a query node q in G and a keyword \(\lambda \), a \(\mathsf {k\text {-}NK}\) query searches k nodes which contain \(\lambda \) and are nearest to q. The \(\mathsf {k\text {-}NK}\) problem has been studied recently in the literature. But most existing solutions assume that the graph as well as the constructed index can fit entirely in memory. As a result, they cannot be applied directly to very large-scale networks which are commonly found in practice, but cannot fit in memory. In this work, we design an I/O-efficient solution, which uses a compact disk index to answer a \(\mathsf {k\text {-}NK}\) query with constant I/Os. The key to an accurate \(\mathsf {k\text {-}NK}\) result is a precise shortest distance estimation in a graph. In our solution, we follow our previous work Qiao et al. (PVLDB 6:901–912, 2013) which uses the shortest path tree as an approximate representation of a graph and uses the tree distance between two nodes as an accurate estimation of the shortest distance between them on a graph. With such representation, the original \(\mathsf {k\text {-}NK}\) query on a graph can be reduced to answering the query on a set of trees and then assembling the results obtained from the trees. We exploit a compact tree-based index and study how to lay out the index to disk. We design a novel technique which decomposes the index tree into paths and subtrees and stores them in disk. Our theoretical analysis shows that the disk-based index is small in size and supports constant query I/Os. Extensive experimental study on massive trees and graphs with billions of edges and keywords verifies our theoretical findings and demonstrates the superiority of our method over the state-of-the-art methods in the literature.
Similar content being viewed by others
References
Kolahdouzan, M.R., Shahabi, C.: Voronoi-based k nearest neighbor search for spatial network databases. In: VLDB, pp. 840–851 (2004)
Samet, H., Sankaranarayanan, J., Alborzi, H.: Scalable network distance browsing in spatial databases. In: SIGMOD, pp. 43–54 (2008)
Sankaranarayanan, J., Samet, H.: Query processing using distance oracles for spatial networks. IEEE Trans. Knowl. Data Eng. 22(8), 1158–1175 (2010)
Bahmani, B., Goel, A.: Partitioned multi-indexing: bringing order to social search. In: WWW, pp. 399–408 (2012)
Qiao, M., Qin, L., Cheng, H., Yu, J.X., Tian, W.: Top-k nearest keyword search on large graphs. PVLDB 6(10), 901–912 (2013)
Jiang, M., Fu, A., Wong, R.: Exact top-k nearest keyword search in large networks. In: SIGMOD, pp. 393–404 (2015)
Sibeyn, J.F., Abello, J., Meyer, U.: Heuristics for semi-external depth first search on directed graphs. In: SPAA, pp. 282–292 (2002)
Zhang, Z., Qin, L., Yu, J.X.: Contract & expand: I/O efficient sccs computing. In: ICDE, pp. 208–219 (2014)
Kyrola, A., Blelloch, G., Guestrin, C.: GraphChi: large-scale graph computation on just a PC. OSDI 12, 31–46 (2012)
Zhu, X., Han, W., Chen, W.: GridGraph: large-scale graph processing on a single machine using 2-level hierarchical partitioning. In: USENIX Annual Technical Conference, pp. 375–386 (2015)
Roy, A., Bindschaedler, L., Malicevic, J., Zwaenepoel, W.: Chaos: Scale-out graph processing from secondary storage. In: SOSP, pp. 410–424 (2015)
Maheshwari, A., Zeh, N.: A survey of techniques for designing I/O-efficient algorithms. In: Algorithms for Memory Hierarchies, pp. 36–61. Springer, Berlin (2003)
Hutchinson, D.A., Maheshwari, A., Zeh, N.: An external memory data structure for shortest path queries. In: COCOON, pp. 51–60 (1999)
Hu, X., Tao, Y., Chung, C.: Massive graph triangulation. In: SIGMOD, pp. 325–336 (2013)
Chu, S., Cheng, J.: Triangle listing in massive networks and its applications. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 21–24 August 2011, pp. 672–680 (2011)
Tao, Y., Papadopoulos, S., Sheng, C., Stefanidis, K.: Nearest keyword search in xml documents. In: SIGMOD, pp. 589–600 (2011)
Bender, M.A., Farach-colton, M.: The LCA problem revisited. In: Gonnet, G.H., Viola, A. (eds.) In Latin American Theoretical Informatics, pp. 88–94. Springer, Berlin (2000)
Michalis, P., Bonchi, F., Castillo, C., Gionis, A.: Fast shortest path distance estimation in large networks. In: CIKM, pp. 867–876 (2009)
Jon, K., Slivkins, A., Wexler, T.: Triangulation and embedding using small sets of beacons. In: IEEE Symposium on Foundations of Computer Science, pp. 444–453 (2004)
Vieira, M.V., Fonseca, B.M., Damazio, R., Golgher, P.B., Reis, D.D.C., Ribeiro-Neto, B.: Efficient search ranking in social networks. In: CIKM, pp. 563–572 (2007)
Dijkstra, E.W.: A note on two problems in connexion with graphs. Numer. Math. 1(1), 269–271 (1959)
Sarma, A.D., Gollapudi, S., Najork, M., Panigrahy, R.: A sketch-based distance oracle for web-scale graphs. In: WSDM, pp. 401–410 (2010)
Spearman, C.: The proof and measurement of association between two things. Am. J. Psychol. 15(1), 72–101 (1904)
Hristidis, V., Papakonstantinou, Y.: Discover: keyword search in relational databases. In: VLDB, pp. 670–681 (2002)
Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., Sudarshan, S.: Keyword searching and browsing in databases using banks. In: ICDE, pp. 431–440 (2002)
Golenberg, K., Kimelfeld, B., Sagiv, Y.: Keyword proximity search in complex data graphs. In: SIGMOD, pp. 927–940 (2008)
Li, G., Ooi, B.C., Feng, J., Wang, J., Zhou, L.: Ease: efficient and adaptive keyword search on unstructured, semi-structured and structured data. In: SIGMOD, pp. 903–914 (2008)
Qin, L., Yu, J.X., Chang, L., Tao, Y.: Querying communities in relational databases. In: ICDE, pp. 724–735 (2009)
Kargar, M., An, A.: Keyword search in graphs: finding r-cliques. PVLDB 4(10), 681–692 (2011)
Yu, J.X., Qin, L., Chang, L.: Keyword search in relational databases: a survey. IEEE Data Eng. Bull. 33(1), 67–78 (2010)
Yao, B., Tang, M., Li, F.: Multi-approximate-keyword routing in gis data. In: GIS, pp. 201–210 (2011)
Cao, X., Chen, L., Cong, G., Xiao, X.: Keyword-aware optimal route search. PVLDB 5(11), 1136–1147 (2012)
Tao, Y.: A dynamic I/O-efficient structure for one-dimensional top-k range reporting. In: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS’14, Snowbird, UT, USA, 22–27 June 2014, pp. 256–265 (2014)
Acknowledgements
The work was supported by The Chinese University of Hong Kong Direct Grant No. 4055048.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhu, Q., Cheng, H. & Huang, X. I/O-efficient algorithms for top-k nearest keyword search in massive graphs. The VLDB Journal 26, 563–583 (2017). https://doi.org/10.1007/s00778-017-0464-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-017-0464-7