Abstract
Enterprises create domain-specific knowledge bases (KBs) by curating and integrating their business data from multiple sources. To support a variety of query types over domain-specific KBs, we propose Hermes, an ontology-based system that allows storing KB data in multiple backends, and querying them with different query languages. In this paper, we address two important challenges in realizing such a system: data placement and schema optimization. First, we identify the best data store for any query type and determine the subset of the KB that needs to be stored in this data store, while minimizing data replication. Second, we optimize how we organize the data for best query performance. To choose the best data stores, we partition the data described by the domain ontology into multiple overlapping subsets based on the operations performed in a given query workload, and place these subsets in appropriate data stores according to their capabilities. Then, we optimize the schema on each data store to enable efficient querying. In particular, we focus on the property graph schema optimization, which has been largely ignored in the literature. We propose two algorithms to generate an optimized schema from the domain ontology. We demonstrate the effectiveness of our data placement and schema optimization algorithms with two real-world KBs from the medical and financial domains. The results show that the proposed data placement algorithm generates near-optimal data placement plans with minimal data replication overhead, and the schema optimization algorithms produce high-quality schemas, achieving up to two orders of magnitude speed-up compared to alternative schema designs.
Similar content being viewed by others
Notes
The terms ObjectProperty and Relationship are used interchangeably in this paper.
Even if inheritance and union are not ObjectProperties, we simplify the notation for presentation purposes.
We make a distinction between stored data that is initially placed in the data stores and intermediate data that is generated during query execution.
Access frequencies of concepts, relationships, and data properties in an ontology.
The neighborhood concepts do not include the member concepts of \(c_i\).
Db2 is a registered trademark of IBM Corporation
We make a distinction between stored data that is initially placed in the data stores and intermediate data that is generated during a query execution.
References
VLDB Workshop: Poly’20. https://sites.google.com/view/poly20/program
Federal deposit insurance corporation. https://www.fdic.gov/regulations/resources/call/index.html (2019)
Gremlin query language. https://tinkerpop.apache.org/gremlin.html (2019)
Janusgraph: Distributed graph database. http://janusgraph.org/ (2019)
The neo4j graph platform. https://neo4j.com/ (2019)
Owl 2 web ontology language document overview. https://www.w3.org/TR/owl2-overview/ (2019)
Securities and exchange commission. https://www.sec.gov/dera/data/financial-statement-data-sets.html (2019)
Apache solr. https://lucene.apache.org/solr/ (2020)
Elasticsearch: Open source search & analytics. https://www.elastic.co/ (2020)
Abadi, D.J., Marcus, A., Madden, S., Hollenbach, K.: SW-Store: a vertically partitioned DBMS for semantic web data management. VLDB J. 18(2), 385–406 (2009)
Abiteboul, S., Hull, R., Vianu, V.: Foundations of databases: the logical level. Addison-Wesley Longman Publishing Co., Inc., Boston (1995)
Agrawal, S., Chaudhuri, S., Narasayya, V.R.: Automated selection of materialized views and indexes in sql databases. VLDB 2000, 496–505 (2000)
Alotaibi, R., Lei, C., Quamar, A., Efthymiou, V., Özcan, F.: Property graph schema optimization for domain-specific knowledge graphs. In: ICDE, pp. 924–935 (2021)
Angles, R., Thakkar, H., Tomaszuk, D.: Mapping rdf databases to property graph databases. IEEE Access 8, 86091–86110 (2020)
Baader, F., Calvanese, D., McGuinness, D.L., Nardi, D., Patel-Schneider, P.F. (eds.): The description logic handbook: theory, implementation, and applications. Cambridge University Press, Cambridge (2003)
Bharadwaj, S., Chiticariu, L., Danilevsky, M., et al.: Creation and interaction with large-scale domain-specific knowledge bases. PVLDB 10(12), 1965–1968 (2017)
Bonifati, A., Martens, W., Timm, T.: An analytical study of large SPARQL query logs. PVLDB 11(2), 149–161 (2017)
Bornea, M.A., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P., Udrea, O., Bhattacharjee, B.: Building an efficient RDF store over a relational database. In: SIGMOD, pp. 121–132 (2013)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: WWW, pp. 107–117 (1998)
Bruno, N., Chaudhuri, S.: Automatic physical database tuning: A relaxation-based approach. In: SIGMOD, pp. 227–238 (2005)
Bugiotti, F., Bursztyn, D., Deutsch, A., I, I., I, M.: Invisible glue: Scalable Self-Tuning Multi-Stores. In: CIDR (2015)
Chawathe, S.S., Garcia-Molina, H., Hammer, J., et al.: The TSIMMIS project: integration of heterogeneous information sources. In: Proceedings of the 10th Meeting of the Information Processing Society of Japan, pp. 7–18 (1994)
Chong, E.I., Das, S., Eadon, G., Srinivasan, J.: An efficient sql-based RDF querying scheme. In: VLDB, pp. 1216–1227 (2005)
Christophides, V., Efthymiou, V., Stefanidis, K.: Entity Resolution in the Web of Data. Theory and Technology. Morgan & Claypool Publishers, Synthesis Lectures on the Semantic Web (2015)
Dash, D., Polyzotis, N., Ailamaki, A.: Cophy: a scalable, portable, and interactive index advisor for large workloads. PVLDB 4(6), 362–372 (2011)
Deutsch, A., Xu, Y., Wu, M., Lee, V.: Tigergraph: a native MPP graph database. CoRR abs/1901.08248 (2019)
Dong, X.L., Srivastava, D.: Big data integration. Synthesis lectures on data management. Morgan & Claypool Publishers, San Rafael (2015)
Du, J., Meehan, J., Tatbul, N., Zdonik, S.: Towards dynamic data placement for polystore ingestion. In: BIRTE, pp. 2:1–2:8 (2017)
Duggan, J., Elmore, A.J., Stonebraker, M., Balazinska, M., Howe, B., et al.: The BigDAWG polystore system. SIGMOD Record 44(2), 11–16 (2015)
Francis, N., Green, A., Guagliardo, P., et al.: Cypher: an evolving query language for property graphs. In: SIGMOD, pp. 1433–1445 (2018)
Gog, I., Schwarzkopf, M., Crooks, N., et al.: Musketeer: all for one, one for all in data processing systems. In: Proceedings of the Tenth European Conference on Computer Systems, p. 2 (2015)
Han, X., Hu, L., Sen, J., Dang, Y., Gao, B., Isahagian, V., Lei, C., et al.: Bootstrapping natural language querying on process automation data. In: IEEE SCC, pp. 170–177. IEEE (2020)
Harris, S., Shadbolt, N.: SPARQL query processing with conventional relational database systems. In: WISE, pp. 235–244 (2005)
Hassan, M.S., Kuznetsova, T., Jeong, H.C., Aref, W.G., Sadoghi, M.: Extending in-memory relational database engines with native graph support. In: EDBT, pp. 25–36 (2018)
Kharlamov, E., Mailis, T., Bereta, K., et al.: A semantic approach to polystores. In: IEEE Big Data, pp. 2565–2573 (2016)
Kimura, H., Huo, G., Rasin, A., Madden, S., Zdonik, S.B.: Coradd: correlation aware database designer for materialized views and indexes. PVLDB 3(1–2), 1103–1113 (2010)
Kolev, B., Bondiombouy, C., Valduriez, P., et al.: The cloudmdsql multistore system. In: SIGMOD, pp. 2113–2116 (2016)
LeFevre, J., Sankaranarayanan, J., Hacigumus, H., et al.: Miso: souping up big data query processing with a multistore system. In: SIGMOD, pp. 1591–1602 (2014)
Lehmann, J., Isele, R., Jakob, M., et al.: Dbpedia - A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web (2015)
Lei, C., Özcan, F., Quamar, A., Mittal, A.R., Sen, J., Saha, D., Sankaranarayanan, K.: Ontology-based natural language query interfaces for data exploration. IEEE Data Eng. Bull. 41(3), 52–63 (2018)
Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets, 2nd edn. Cambridge University Press, New York, NY, USA (2014)
Levy, A., Rajaraman, A., Ordille, J.: Querying heterogeneous information sources using source descriptions. Tech. rep, Stanford InfoLab (1996)
Lu, J., Holubová, I., Cautis, B.: Multi-model databases and tightly integrated polystores: Current practices, comparisons, and open challenges. In: CIKM, p. 2301–2302 (2018)
Maduko, A., Anyanwu, K., Sheth, A.P., Schliekelman, P.: Estimating the cardinality of RDF graph patterns. In: WWW, pp. 1233–1234 (2007)
McHugh, J., Cuddihy, P.E., Williams, J.W., et al.: Integrated access to big data polystores through a knowledge-driven framework. In: IEEE Big Data (2017)
Mior, M.J., Salem, K., Aboulnaga, A., Liu, R.: Nose: schema design for nosql applications. In: ICDE, pp. 181–192 (2016)
Neumann, T., Moerkotte, G.: Characteristic sets: accurate cardinality estimation for RDF queries with multiple joins. In: ICDE, pp. 984–994 (2011)
Neumann, T., Weikum, G.: The RDF-3X engine for scalable management of RDF data. VLDB J. 19(1), 91–113 (2010)
Pirahesh, H., Hellerstein, J.M., Hasan, W.: Extensible/rule based query rewrite optimization in starburst. In: SIGMOD, pp. 39–48 (1992)
Quamar, A., Kumar, K.A., Deshpande, A.: SWORD: scalable workload-aware data placement for transactional workloads. In: EDBT, pp. 430–441 (2013)
Quamar, A., Özcan, F., Xirogiannopoulos, K.: Discovery and creation of rich entities for knowledge bases. In: ExploreDB (2018)
Quamar, A., Straube, J., Tian, Y.: Enabling rich queries over heterogeneous data from diverse sources in healthcare. In: CIDR (2020)
Saha, D., Floratou, A., Sankaranarayanan, K., et al.: Athena: an ontology-driven system for natural language querying over relational data stores. PVLDB 9(12), 1209–1220 (2016)
Sen, J., Ozcan, F., Quamar, A., Stager, G., Mittal, A.R., Jammi, M., Lei, C., Saha, D., Sankaranarayanan, K.: Natural language querying of complex business intelligence queries. In: SIGMOD, pp. 1997–2000 (2019)
Slavík, P.: A tight analysis of the greedy algorithm for set cover. In: STOC ’96 (1996)
Stonebraker, M.: The case for polystores. https://wp.sigmod.org/?p=1629 (2015)
Stonebraker, M., Cetintemel, U.: “one size fits all”: an idea whose time has come and gone. In: ICDE, p. 2–11 (2005)
Suchanek, F.M., Weikum, G.: Knowledge harvesting in the big-data era. In: SIGMOD, pp. 933–938 (2013)
Sun, W., Fokoue, A., Srinivas, K., Kementsietsidis, A., Hu, G., Xie, G.T.: Sqlgraph: an efficient relational-based property graph store. In: SIGMOD, pp. 1887–1901 (2015)
Tanon, T.P., Weikum, G., Suchanek, F.M.: YAGO 4: A reason-able knowledge base. In: ESWC, pp. 583–596 (2020)
Tian, Y., Xu, E.L., Zhao, W., et al.: IBM db2 graph: supporting synergistic and retrofittable graph queries inside IBM db2. In: SIGMOD, pp. 345–359 (2020)
Tsialiamanis, P., Sidirourgos, L., Fundulaki, I., et al.: Heuristics-based query optimisation for SPARQL. In: EDBT, pp. 324–335 (2012)
Vazirani, V.V.: Approximation Algorithms. Springer-Verlag, Berlin, Heidelberg (2001)
Xiao, G., Calvanese, D., Kontchakov, R., Lembo, D., Poggi, A., Rosati, R., Zakharyaschev, M.: Ontology-based data access: a survey. In: IJCAI, p. 5511–5519 (2018)
Zilio, D.C., Rao, J., Lightstone, S., et al.: Db2 design advisor: integrated automatic physical database design. In: VLDB, pp. 1087–1097 (2004)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Chuan Lei, Vasilis Efthymiou, Fatma Özcan, Rana Alotaibi: Work done while at IBM Research.
Rights and permissions
About this article
Cite this article
Lei, C., Quamar, A., Efthymiou, V. et al. HERMES: data placement and schema optimization for enterprise knowledge bases. The VLDB Journal 32, 549–574 (2023). https://doi.org/10.1007/s00778-022-00756-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-022-00756-y