[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3448016.3457316acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Structural Generalizability: The Case of Similarity Search

Published: 18 June 2021 Publication History

Abstract

Supervised and Unsupervised ML algorithms are widely used over graphs. They use the structural properties of the data to deliver effective results. It is known that the same information can be represented under various graph structures. Thus, these algorithms may be effective on some structural variations of the data and ineffective on others. One would like to have an algorithm that is effective and generalizes to all structural variations of a data graph. We define the concept of structural generalizability for algorithms over graphs. We focus on the problem of similarity search, which is a popular task and the building block of many ML algorithms on graphs, and propose a structurally generalizable similarity search algorithm. As this algorithm may require users to specify features in a rather complex language, we modify this algorithm so that it requires only simple guidance from the user. Our extensive empirical study show that our algorithms are structurally generalizable while being efficient and more effective than current algorithms.

Supplementary Material

MP4 File (3448016.3457316.mp4)
Graph similarity search algorithms usually leverage the structural properties of a database. Hence, these algorithms are effective only on some structural variations of the data and are ineffective on other forms, which makes them hard to use. Ideally, one would like to design a data analytics algorithm that is structurally robust, i.e., it returns essentially the same accurate results over all possible structural variations of a dataset. We propose a novel approach to create a structurally robust similarity search algorithm over graph databases. We leverage the classic insight in the database literature that schematic variations are caused by having constraints in the database. We then present RelSim algorithm which is provably structurally robust under these variations. Our empirical studies show that our proposed algorithms are structurally robust while being efficient and as effective as or more effective than the state- of-the-art similarity search algorithms.

References

[1]
Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases: The Logical Level. Addison-Wesley.
[2]
Ioannis Antonellis, Hector Garcia-Molina, and Chi-Chao Chang. 2008. Simrank+: query rewriting through link analysis of the click graph. Proc. VLDB Endow., Vol. 1, 1 (2008), 408--421. https://doi.org/10.14778/1453856.1453903
[3]
Marcelo Arenas. 2006. Normalization Theory for XML. SIGMOD Rec., Vol. 35, 4 (Dec. 2006), 57--64. https://doi.org/10.1145/1228268.1228284
[4]
M. Arenas and L. Libkin. 2004. A Normal Form for XML Documents. TODS, Vol. 29, 1 (2004).
[5]
Marcelo Arenas, Jorge Pé rez, Juan L. Reutter, and Cristian Riveros. 2009b. Inverting Schema Mappings: Bridging the Gap between Theory and Practice. PVLDB, Vol. 2, 1 (2009), 1018--1029.
[6]
Marcelo Arenas, Jorge Pérez, and Cristian Riveros. 2009a. The Recovery of a Schema Mapping: Bringing Exchanged Data Back. ACM Trans. Database Syst., Vol. 34, 4, Article 22 (Dec. 2009), 48 pages. https://doi.org/10.1145/1620585.1620589
[7]
Magda Balazinska, Surajit Chaudhuri, Anastasia Ailamaki, Juliana Freire, Sailesh Krishnamurthy, and Michael Stonebraker. 2020. The Next 5 Years: What Opportunities Should the Database Community Seize to Maximize its Impact?. In SIGMOD (Portland, OR, USA). ACM, 411--414. https://doi.org/10.1145/3318464.3394837
[8]
Pablo Barcelo, Jorge Perez, and Juan Reutter. 2013. Schema Mappings and Data Exchange for Graph Databases. In ICDT.
[9]
Catriel Beeri and Moshe Y. Vardi. 1984. A Proof Procedure for Data Dependencies. J. ACM, Vol. 31, 4 (Sept. 1984), 24. https://doi.org/10.1145/1634.1636
[10]
I. Boneva, A. Bonifati, and R. Ciucanu. 2015. Graph data exchange with target constraints. In EDBT/ICDT Workshop GraphQ (Chicago, Illinois, USA) (PODS '17). 171--176.
[11]
Y. Chodpathumwan, A. Aleyasen, A. Termehchy, and Y. Sun. 2016. Towards Representation Independent Similarity Search Over Graph Databases. In CIKM. https://doi.org/10.1145/2983323.2983673
[12]
Y. Chodpathumwan, A. Termehchy, S. Ramsey, A. Shresta, A. Glen, and Z. Liu. 2021. Structural Generalizability: The Case of Similarity Search. arxiv: 1508.03763 [cs.DB]
[13]
I. Cruz, A. Mendelzon, and P. Wood. 1987. A graphical query language supporting recursion. In SIGMOD. 323--330.
[14]
Ronald Fagin. 2007. Inverting Schema Mappings. ACM Trans. Database Syst., Vol. 32, 2 (2007).
[15]
Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, and Wang-Chiew Tan. 2005. Composing Schema Mappings: Second-order Dependencies to the Rescue. ACM Trans. Database Syst., Vol. 30, 4 (Dec. 2005), 994--1055. https://doi.org/10.1145/1114244.1114249
[16]
Ronald Fagina, Phokion Kolaitis, Renee Miller, and Lucian Popa. 2005. Data exchange: semantics and query answering. TCS, Vol. 336, 1 (2005).
[17]
Wenfei Fan and Philip Bohannon. 2008. Information Preserving XML Schema Embedding. TODS, Vol. 33, 1 (2008).
[18]
Wenfei Fan, Yinghui Wu, and Jingbo Xu. 2016. Functional Dependencies for Graphs. In SIGMOD (San Francisco, California, USA) (SIGMOD '16). ACM, New York, NY, USA, 1843--1857.
[19]
R. C. Fernandez, D. Deng, E. Mansour, A. A. Qahtan, W. Tao, Z. Abedjan, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. 2017. A Demo of the Data Civilizer System. In SIGMOD. ACM, 1639--1642. https://doi.org/10.1145/3035918.3058740
[20]
Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. 2003. A Large-Scale Study of the Evolution of Web. In WWW.
[21]
Nadime Francis and Leonid Libkin. 2017. Schema Mappings for Data Graphs. In PODS (Chicago, Illinois, USA) (PODS '17). ACM, New York, NY, USA, 389--401. https://doi.org/10.1145/3034786.3056113
[22]
Vijay Gadepally, Timothy G. Mattson, Michael Stonebraker, Fusheng Wang, Gang Luo, and George Teodoro (Eds.). 2019. Heterogeneous Data Management, Polystores, and Analytics for Healthcare - VLDB 2018 Workshops, Poly and DMAH, Rio de Janeiro, Brazil, August 31, 2018, Revised Selected Papers. Lecture Notes in Computer Science, Vol. 11470. Springer. https://doi.org/10.1007/978--3-030--14177--6
[23]
Gourab Ghoshal and Albert Barbasi. 2011. Ranking Stability and Super-stable Nodes in Complex Networks. Nature Communications, Vol. 2, 394 (2011).
[24]
C. Gutierrez, C. Hurtado, A. Mendelzon, and Jorge Perez. 2010. Foundations of Semantic Web Databases. JCSS, Vol. 77, 3 (2010).
[25]
Guoming He, Haijun Feng, Cuiping Li, and Hong Chen. 2010. Parallel SimRank Computation on Large Graphs with Iterative Aggregation. In KDD.
[26]
André Hernich, Leonid Libkin, and Nicole Schweikardt. 2011. Closed World Data Exchange. ACM Trans. Database Syst., Vol. 36, 2, Article 14 (June 2011), 40 pages. https://doi.org/10.1145/1966385.1966392
[27]
John Hopcroft and Robert Tarjan. 1973. Algorithm 447: Efficient Algorithms for Graph Manipulation. Commun. ACM, Vol. 16, 6 (June 1973), 372--378. https://doi.org/10.1145/362248.362272
[28]
R. Hull. 1984. Relative Information Capacity of Simple Relational Database Schemata. PODS.
[29]
Glen Jeh and Jennifer Widom. 2002. SimRank: a measure of structural-context similarity. In Proceedings of the 8th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining, July 23--26, 2002, Edmonton, Alberta, Canada. ACM, 538--543. https://doi.org/10.1145/775047.775126
[30]
G. Jeh and J. Widom. 2003. Scaling Personalized Web Search. In WWW.
[31]
David Jensen and Jennifer Neville. 2002. Linkage and Autocorrelation Cause Feature Selection Bias in Relational Learning. In Proceedings of the 19th International Conference on Machine Learning (ICML '02). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 259--266.
[32]
Leo Katz. 1953. A new status index derived from sociometric analysis. Psychometrika, Vol. 18, 1 (1953), 39--43.
[33]
Barbara Staudt Lerner. 2000. A Model for Compound Type Changes Encountered in Schema Evolution. TODS, Vol. 25, 1 (2000).
[34]
Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze. 2008. An Introduction to Information Retrieval. Cambridge University Press.
[35]
Sergey Melnik, Atul Adya, and Philip A. Bernstein. 2007. Compiling mappings to bridge applications and databases. In SIGMOD.
[36]
Tom M. Mitchell. 1997. Machine Learning .McGraw-Hill, New York.
[37]
Jose Picado, Arash Termehchy, Alan Fern, and Parisa Ataei. 2017. Schema Independent Relational Learning. In SIGMOD (Chicago, Illinois, USA) (SIGMOD '17). ACM, New York, NY, USA, 929--944. https://doi.org/10.1145/3035918.3035923
[38]
Philippe Rigollet. 2007. Generalization Error Bounds in Semisupervised Classification under the Cluster Assumption. Journal of Machine Learning Research, Vol. 8 (2007).
[39]
Ohad Shamir and Naftali Tishby. 1982. Cluster Stability for Finite Samples. Annals of Probability, Vol. 10, 4 (1982).
[40]
Chuan Shi, Xiangnan Kong, Yue Huang, S Yu Philip, and Bin Wu. 2014. HeteSim: A General Framework for Relevance Measure in Heterogeneous Networks. TKDE 10 (2014).
[41]
Michael Stonebraker, Raul Castro Fernandez, Dong Deng, and Michael L. Brodie. 2017. What to do about database decay. Commun. ACM, Vol. 60, 1 (2017), 11. https://doi.org/10.1145/3014349
[42]
Michael Stonebraker and Ihab F. Ilyas. 2018. Data Integration: The Current Status and the Way Forward. IEEE Data Eng. Bull., Vol. 41, 2 (2018), 3--9. http://sites.computer.org/debull/A18june/p3.pdf
[43]
Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, and Tianyi Wu. 2011. PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks. Proc. VLDB Endow., Vol. 4, 11 (2011), 992--1003. http://www.vldb.org/pvldb/vol4/p992-sun.pdf
[44]
Zequn Sun, Qingheng Zhang, Wei Hu, Chengming Wang, Muhao Chen, Farahnaz Akrami, and Chengkai Li. 2020. A Benchmarking Study of Embedding-based Entity Alignment for Knowledge Graphs. Proc. VLDB Endow., Vol. 13, 11 (2020), 2326--2340. http://www.vldb.org/pvldb/vol13/p2326-sun.pdf
[45]
Jian Tang, Cheng Li, and Qiaozhu Mei. 2017. Learning Representations of Large-scale Networks. In KDD.
[46]
Arash Termehchy, Marianne Winslett, and Yodsawalai Chodpathumwan. 2011. How Schema Independent Are Schema Free Query Interfaces?. In ICDE.
[47]
Hanghang Tong and Christos Faloutsos. 2006. Center-Piece Subgraphs: Problem Definition and Fast Solutions. In KDD.
[48]
H. Tong, C. Faloutsos, and J. Pan. 2006. Fast Random Walk with Restart and its Applications. In ICDM.
[49]
Hanghang Tong, Huiming Qu, and Hani Jamjoom. 2008. Measuring Proximity on Graphs with Side Information. In ICDM.
[50]
Ba Quan Truong, Sourav S. Bhowmick, and Curtis Dyreson. 2012. SINBAD: Towards Structure-Independent Querying of Common Neighbors in XML Databases .Springer Berlin Heidelberg, Berlin, Heidelberg, 156--171. https://doi.org/10.1007/978--3--642--29038--1_13
[51]
M. Y. Vardi. 1982. On decomposition of relational databases. In FOCS. 176--185. https://doi.org/10.1109/SFCS.1982.75
[52]
Yannis Velegrakis, Renee J. Miller, and Lucian Popa. 2003. Mapping Adaptation under Evolving Schemas. In VLDB.
[53]
S. Vishwanathan, N. Schraudolph, R. Kondor, and K. Borgwardt. 2010. Graph Kernels. JMLR (2010).
[54]
Peter T. Wood. 2012. Query languages for graph databases. SIGMOD Rec., Vol. 41, 1 (April 2012). https://doi.org/10.1145/2206869.2206879
[55]
C. Yu and H. V. Jagadish. 2006. Efficient Discovery of XML Data Redundancy. In VLDB.
[56]
Weiren Yu and Xuemin Lin. 2012. IRWR: Incremental Random Walk with Restart. In SIGIR.
[57]
Jing Zhang, Jie Tang, Cong Ma, Hanghang Tong, Yu Jing, and Juanzi Li. 2015. Panther: Fast Top-k Similarity Search on Large Networks. In Proceedings of the 21th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining (Sydney, NSW, Australia) (KDD '15). ACM, New York, NY, USA, 1445--1454. https://doi.org/10.1145/2783258.2783267
[58]
Peixiang Zhao, Jiawei Han, and Yizhou Sun. 2009. P-Rank: A Comprehensive Structural Similarity Measure over Information Networks. In CIKM.

Cited By

View all
  • (2022)RTX-KG2: a system for building a semantically standardized knowledge graph for translational biomedicineBMC Bioinformatics10.1186/s12859-022-04932-323:1Online publication date: 29-Sep-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
June 2021
2969 pages
ISBN:9781450383431
DOI:10.1145/3448016
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. similarity search
  2. structural generalizability
  3. structural variations

Qualifiers

  • Research-article

Funding Sources

Conference

SIGMOD/PODS '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)1
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)RTX-KG2: a system for building a semantically standardized knowledge graph for translational biomedicineBMC Bioinformatics10.1186/s12859-022-04932-323:1Online publication date: 29-Sep-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media