VC-SLAM—A Handcrafted Data Corpus for the Construction of Semantic Models
<p>Example semantic model created for the corpus developed in this paper. This model describes water hydrants in Edmonton (ID 39). Each node consists of a concept and, if directly mapped to a data attribute, the attribute name accompanied by its data type.</p> "> Figure 2
<p>The process of ontology and semantic model creation.</p> "> Figure 3
<p>Overview of the different components of VC-SLAM and their interaction for semantic modeling.</p> "> Figure 4
<p>The boxplot visualizes for the individual data sets how many data points they consist of, and how many attributes each raw data set contains.</p> "> Figure 5
<p>The line chart visualizes the number of shared concepts between models. On the x-axis we defined a threshold, and on the y-axis, the square root of the sum of all models having at least as many shared concepts defined by x.</p> "> Figure 6
<p>Visualization of the resulting ontology demonstrating the high degree of connectivity and the large number of clusters.</p> "> Figure 7
<p>Resulting accuracy of the DSL approach on VC-SLAM.</p> ">
Abstract
:1. Introduction
2. Landscape of Semantic Corpora
2.1. Semantic Labeling and Modeling
2.2. Utilized Corpora and Resulting Obstacles
2.3. Objectives for Semantic Mapping Corpora
3. The VC-SLAM Corpus
3.1. Description of the Corpus
- : JSON-based samples with attribute names and values;
- : the raw data set including all its values;
- : a short textual description of the data set content;
- : a textual data documentation of the data set, especially covering its meaning and its structure, as well as all other available metadata in an unstructured way;
- : a set of semantic models, , describing this data source where ;
- For each concept , it also holds that .
3.2. Data Set Identification and Acquisition
3.3. Modeling Setup
3.4. Modeling
4. Statistics and Discussion
4.1. Corpus Statistics
4.1.1. Raw Data
4.1.2. Metadata
4.1.3. Semantic Models
4.1.4. Final Ontology
4.2. Testing VC-SLAM on Existing Algorithms
4.2.1. VC-SLAM for Semantic Labeling
4.2.2. VC-SLAM for Semantic Modeling
4.3. Limitations
5. Conclusions and Outlook
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
1 | https://github.com/tmdt-buw/vc-slam (accessed on 14 September 2021). |
2 | https://zenodo.org/record/5782764 (accessed on 14 September 2021). |
3 | https://www.cs.ox.ac.uk/isg/challenges/sem-tab/ (accessed on 14 September 2021). |
4 | https://viznet.media.mit.edu/ (accessed on 14 September 2021). |
5 | http://webdatacommons.org/webtables/goldstandardV2.html (accessed on 14 September 2021). |
6 | http://yann.lecun.com/exdb/mnist/ (accessed on 14 September 2021). |
7 | http://www.image-net.org (accessed on 14 September 2021). |
8 | https://ai.google.com/research/NaturalQuestions (accessed on 14 September 2021). |
9 | A list of the concrete ODPs that were scanned is available at our repository: https://github.com/tmdt-buw/vc-slam (accessed on 14 September 2021). |
10 | https://github.com/tmdt-buw/plasma (accessed on 14 September 2021). |
11 | https://github.com/giuseppefutia/semi (accessed on 14 September 2021). |
References
- Paulus, A.; Burgdorf, A.; Pomp, A.; Meisen, T. Recent Advances and Future Challenges of Semantic Modeling. In Proceedings of the 2021 IEEE 15th International Conference on Semantic Computing (ICSC), Laguna Hills, CA, USA, 27–29 January 2021; pp. 70–75. [Google Scholar]
- Burgdorf, A.; Pomp, A.; Meisen, T. Towards NLP-supported Semantic Data Management. arXiv 2020, arXiv:2005.06916. [Google Scholar]
- Polfliet, S.; Ichise, R. Automated Mapping Generation for Converting Databases into Linked Data. In Proceedings of the 2010 International Conference on Posters & Demonstrations Track, Shanghai, China, 9 November 2010; pp. 173–176. [Google Scholar]
- Pinkel, C.; Binnig, C.; Kharlamov, E.; Haase, P. IncMap: Pay as You Go Matching of Relational Schemata to OWL Ontologies. In Proceedings of the 8th International Conference on Ontology Matching, Sydney, Australia, 21 October 2013; Volume 1111, pp. 37–48. [Google Scholar]
- Pinkel, C.; Binnig, C.; Jiménez-Ruiz, E.; Kharlamov, E.; Nikolov, A.; Schwarte, A.; Heupel, C.; Kraska, T. IncMap: A Journey Towards Ontology-based Data Integration. In Proceedings of the Datenbanksysteme für Business, Technologie und Web (BTW 2017), Stuttgart, Germany, 6–10 March 2017. [Google Scholar]
- Paulus, A.; Pomp, A.; Poth, L.; Lipp, J.; Meisen, T. Gathering and Combining Semantic Concepts from Multiple Knowledge Bases. In Proceedings of the ICEIS 2018, Funchal, Madeira, Portugal, 21–24 March 2018; pp. 69–80. [Google Scholar]
- Papapanagiotou, P.; Katsiouli, P.; Katsiouli, P.; Tsetsos, V.; Anagnostopoulos, C.; Hadjiefthymiades, S. RONTO: Relational to Ontology Schema Matching. AIS Sigsemis Bull. 2006, 3, 32–36. [Google Scholar]
- Goel, A.; Knoblock, C.; Lerman, K. Exploiting Structure within Data for Accurate Labeling Using Conditional Random Fields. In Proceedings of the 14th International Conference on Artificial Intelligence (ICAI), Las Vegas, NV, USA, 16–19 July 2012. [Google Scholar]
- Ramnandan, S.K.; Mittal, A.; Knoblock, C.A.; Szekely, P. Assigning semantic labels to data sources. In Proceedings of the European Semantic Web Conference, Portoroz, Slovenia, 31 May–4 June 2015; pp. 403–417. [Google Scholar]
- Pham, M.; Alse, S.; Knoblock, C.A.; Szekely, P. Semantic Labeling: A Domain-Independent Approach. In Proceedings of the Semantic Web—ISWC 2016, Kobe, Japan, 17–21 October 2016; Springer International Publishing: Cham, Switzerland, 2016; pp. 446–462. [Google Scholar]
- Rümmele, N.; Tyshetskiy, Y.; Collins, A. Evaluating approaches for supervised semantic labeling. In Proceedings of the Workshop on Linked Data on the Web, Lyon, France, 23 April 2018. [Google Scholar]
- Abdelmageed, N.; Schindler, S. JenTab: Matching Tabular Data to Knowledge Graphs. In Proceedings of the 19th International Semantic Web Conference (ISWC) 2020, Athens, Greece, 2–6 November 2020. [Google Scholar]
- Knoblock, C.A.; Szekely, P.; Ambite, J.L.; Goel, A.; Gupta, S.; Lerman, K.; Muslea, M.; Taheriyan, M.; Mallick, P. Semi-automatically mapping structured sources into the semantic web. In Proceedings of the Extended Semantic Web Conference, Crete, Greece, 27–31 May 2012; pp. 375–390. [Google Scholar]
- Taheriyan, M.; Knoblock, C.A.; Szekely, P.; Ambite, J.L. A Graph-Based Approach to Learn Semantic Descriptions of Data Sources. In Proceedings of the Semantic Web—ISWC 2013, Sydney, Australia, 21–25 October 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 607–623. [Google Scholar]
- Taheriyan, M.; Knoblock, C.A.; Szekely, P.; Ambite, J.L. A Scalable Approach to Learn Semantic Models of Structured Sources. In Proceedings of the 2014 IEEE International Conference on Semantic Computing, Newport Beach, CA, USA, 16–18 June 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 183–190. [Google Scholar]
- Taheriyan, M.; Knoblock, C.A.; Szekely, P.; Ambite, J.L. Learning the Semantics of Structured Data Sources. J. Web Semant. 2016, 37-38, 152–169. [Google Scholar] [CrossRef] [Green Version]
- Taheriyan, M.; Knoblock, C.A.; Szekely, P.; Ambite, J.L. Leveraging Linked Data to Discover Semantic Relations within Data Sources. In Proceedings of the ISWC 2016—15th International Semantic Web Conference, Kobe, Japan, 17–21 October 2016. [Google Scholar]
- Uña, D.D.; Rümmele, N.; Gange, G.; Schachte, P.; Stuckey, P.J. Machine Learning and Constraint Programming for Relational-To-Ontology Schema Mapping. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 1277–1283. [Google Scholar]
- Vu, B.; Knoblock, C.; Pujara, J. Learning Semantic Models of Data Sources Using Probabilistic Graphical Models. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; ACM: New York, NY, USA, 2019; pp. 1944–1953. [Google Scholar]
- Hulsebos, M.; Hu, K.; Bakker, M.; Zgraggen, E.; Satyanarayan, A.; Kraska, T.; Demiralp, Ç.; Hidalgo, C. Sherlock: A deep learning approach to semantic data type detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 1500–1508. [Google Scholar]
- Jiménez-Ruiz, E.; Hassanzadeh, O.; Efthymiou, V.; Chen, J.; Srinivas, K. SemTab 2019: Resources to Benchmark Tabular Data to Knowledge Graph Matching Systems. In Proceedings of the European Semantic Web Conference, Athens, Greece, 2–6 November 2020; pp. 514–530. [Google Scholar]
- Hu, K.; Gaikwad, S.; Hulsebos, M.; Bakker, M.A.; Zgraggen, E.; Hidalgo, C.; Kraska, T.; Li, G.; Satyanarayan, A.; Demiralp, Ç. Viznet: Towards a large-scale visualization learning and benchmarking repository. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Glasgow, UK, 4–9 May 2019; pp. 1–12. [Google Scholar]
- LeCun, Y. The MNIST Database of Handwritten Digits; NEC Research Institute: Princeton, NJ, USA, 1998. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the CVPR09, Miami Beach, FL, USA, 20–26 June 2009. [Google Scholar]
- Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A.; Alberti, C.; Epstein, D.; Polosukhin, I.; Devlin, J.; Lee, K.; et al. Natural Questions: A Benchmark for Question Answering Research. Trans. Assoc. Comput. Linguist. 2019, 7, 453–466. [Google Scholar] [CrossRef]
- Paulus, A.; Burgdorf, A.; Puleikis, L.; Langer, T.; Pomp, A.; Meisen, T. PLASMA: Platform for Auxiliary Semantic Modeling Approaches. In Proceedings of the 23rd International Conference on Enterprise Information Systems, Online. 26–28 April 2021; pp. 403–412. [Google Scholar] [CrossRef]
- Futia, G.; Vetrò, A.; De Martin, J.C. SeMi: A SEmantic Modeling machIne to build Knowledge Graphs with graph neural networks. SoftwareX 2020, 12, 100516. [Google Scholar] [CrossRef]
Data Corpora | |||||
---|---|---|---|---|---|
Publication | Year | Weather Forecast | Flight Status | Geocoding | Phone Directory |
Goel [8] | 2012 | 0.89 | 0.97 | 0.98 | |
Ramnandan [9] | 2015 | 0.96 | 0.64 | 0.83 | |
Goel * [9] | 2015 | 0.88 | 0.42 | 0.7 | |
Pham [10] | 2016 | 0.95–0.96 | |||
Ruemmele [11] | 2018 | 0.98 |
Publication | Linking Open Drug Data | City (17) | Museum (5) | Museum (29) | Museum (28) | Weapon Ads | Weather | City (10) | Soccer | Weather Forecast | Flight Status | Geocoding | Phone Directory | AAC | ACE | CIDOC-CRM | DBpedia | Dublin Core | EDM | ElementsGr2 | FOAF | FRBR | GeoNames | ORE | Schema.org | SKOS | WGS84 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[8] | x | x | x | ||||||||||||||||||||||||
[9] | x | x | x | x | x | x | |||||||||||||||||||||
[10] | x | x | x | x | x | ||||||||||||||||||||||
[11] | x | x | x | x | x | ||||||||||||||||||||||
[13] | x | x | |||||||||||||||||||||||||
[14] | x | x | x | x | x | x | x | ||||||||||||||||||||
[15] | x | x | x | x | x | x | x | ||||||||||||||||||||
[16] | x | x | x | x | x | x | x | x | x | x | x | ||||||||||||||||
[17] | x | x | x | x | x | ||||||||||||||||||||||
[18] | x | x | |||||||||||||||||||||||||
[19] | x | x | x | x | x | x | x |
Words | Words Attributes | Attributes in | Attributes in Metadata | Concepts in | Concepts in Metadata | |
---|---|---|---|---|---|---|
Attributes | Metadata | Attributes | Metadata | Used Concepts | ||
mean | 261.98 | 14.61 | 9.06 | 0.51 | 10.83 | 0.52 |
std | 180.68 | 8.53 | 6.56 | 0.29 | 4.20 | 0.13 |
min | 37.00 | 3.72 | 0.00 | 0.00 | 3.00 | 0.24 |
25% | 128.00 | 7.57 | 4.00 | 0.25 | 8.00 | 0.43 |
50% | 211.00 | 13.09 | 8.00 | 0.50 | 10.00 | 0.53 |
75% | 337.00 | 20.12 | 12.00 | 0.79 | 13.00 | 0.62 |
max | 948.00 | 44.91 | 30.00 | 1.00 | 21.00 | 0.82 |
Concepts | Unique Concepts | Context Concepts | Data Concepts | Unmodeled Attributes | Relations | Unique Relations | |
---|---|---|---|---|---|---|---|
mean | 23.74 | 20.77 | 8.93 | 14.81 | 3.43 | 28.10 | 11.27 |
std | 8.40 | 5.98 | 3.78 | 6.58 | 2.79 | 11.34 | 3.36 |
min | 10.00 | 10.00 | 3.00 | 4.00 | 0.00 | 10.00 | 5.00 |
25% | 17.00 | 17.00 | 6.00 | 10.00 | 1.00 | 20.00 | 9.00 |
50% | 22.00 | 20.00 | 8.00 | 13.00 | 3.00 | 26.00 | 11.00 |
75% | 29.00 | 25.00 | 11.00 | 19.00 | 5.00 | 34.00 | 13.00 |
max | 55.00 | 36.00 | 23.00 | 32.00 | 11.00 | 68.00 | 20.00 |
Concepts | Relations | |
---|---|---|
count | 483.00 | 117.00 |
mean | 5.04 | 24.26 |
std | 13.78 | 73.72 |
min | 1.00 | 1.00 |
25% | 1.00 | 1.00 |
50% | 1.00 | 4.00 |
75% | 3.00 | 10.00 |
max | 113.00 | 535.00 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Burgdorf, A.; Paulus, A.; Pomp, A.; Meisen, T. VC-SLAM—A Handcrafted Data Corpus for the Construction of Semantic Models. Data 2022, 7, 17. https://doi.org/10.3390/data7020017
Burgdorf A, Paulus A, Pomp A, Meisen T. VC-SLAM—A Handcrafted Data Corpus for the Construction of Semantic Models. Data. 2022; 7(2):17. https://doi.org/10.3390/data7020017
Chicago/Turabian StyleBurgdorf, Andreas, Alexander Paulus, André Pomp, and Tobias Meisen. 2022. "VC-SLAM—A Handcrafted Data Corpus for the Construction of Semantic Models" Data 7, no. 2: 17. https://doi.org/10.3390/data7020017
APA StyleBurgdorf, A., Paulus, A., Pomp, A., & Meisen, T. (2022). VC-SLAM—A Handcrafted Data Corpus for the Construction of Semantic Models. Data, 7(2), 17. https://doi.org/10.3390/data7020017