[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Storing, Tracking, and Querying Provenance in Linked Data

Published: 01 August 2017 Publication History

Abstract

The proliferation of heterogeneous Linked Data on the Web poses new challenges to database systems. In particular, the capacity to store, track, and query provenance data is becoming a pivotal feature of modern triplestores. We present methods extending a native RDF store to efficiently handle the storage, tracking, and querying of provenance in RDF data. We describe a reliable and understandable specification of the way results were derived from the data and how particular pieces of data were combined to answer a query. Subsequently, we present techniques to tailor queries with provenance data. We empirically evaluate the presented methods and show that the overhead of storing and tracking provenance is acceptable. Finally, we show that tailoring a query with provenance information can also significantly improve the performance of query execution.

References

[1]
T. Heath and C. Bizer, Linked Data: Evolving the Web into a Global Data Space. San Rafael, CA, USA : Morgan & Claypool, 2011.
[2]
C. A. Knoblock, et al., “Semi-automatically mapping structured sources into the Semantic Web,” presented at the Extended Semantic Web Conf., Crete, Greece, 2012.
[3]
A. Schultz, A. Matteini, R. Isele, C. Bizer, and C. Becker, “LDIF-linked data integration framework,” in Proc. 2nd Int. Conf. Consuming Linked Data, 2011, pp. 125– 130.
[4]
P. Groth and L. Moreau, Eds., “PROV-overview. An overview of the PROV family of documents,” World Wide Web Consortium, W3C Working Group Note NOTE-prov-overview-20130430, Cambridge, MA, USA, Apr. 2013.
[5]
J. J. Carroll, C. Bizer, P. Hayes, and P. Stickler, “Named graphs, provenance and trust,” in Proc. 14th Int. Conf. World Wide Web, 2005, pp. 613–622.
[6]
D. Wood, R. Cyganiak and M. Lanthaler, Eds., “ RDF 1.1 Concepts and Abstract Syntax,” W3C Recommendation, Feb. 2014. [Online]. Available: http://www.w3.org/TR/rdf11-concepts/
[7]
M. Schmachtenberg, C. Bizer, and H. Paulheim, “Adoption of the linked data best practices in different topical domains,” in The Semantic Web , P. Mika, et al., Eds. Berlin, Germany: Springer, 2014, pp. 245–260. [Online]. Available: https://doi.org/10.1007/978-3-319-11964-9_16
[8]
G. Grimnes, “BTC2012 stats,” 2012. [Online]. Available: http://gromgull.net/blog/2012/07/some-basic-btc2012-stats/
[9]
P. Groth and W. Beek, “Measuring PROV provenance on the Web of data,” 2016. [Online]. Available: https://nbviewer.jupyter.org/github/pgroth/prov-wod-analysis/blob/master/MeasuringPROVProvenanceWebofData.ipynb
[10]
M. Wylot, J. Pont, M. Wisniewski, and P. Cudr é-Mauroux, “ dipLODocus[RDF]: Short and long-tail RDF analytics for massive webs of data,” in Proc. 10th Int. Conf. Semantic Web, 2011, pp. 778–793. [Online]. Available: http://dl.acm.org/citation.cfm?id=2063016.2063066
[11]
M. Wylot and P. Cudré-Mauroux, “DiploCloud: Efficient and scalable management of RDF data in the cloud,” IEEE Trans. Knowl. Data Eng., vol. 28, no. 3, pp. 659–674, Mar. 2016.
[12]
M. Wylot, P. Cudre-Mauroux, and P. Groth, “ TripleProv: Efficient processing of lineage queries in a native RDF store,” in Proc. 23rd Int. Conf. World Wide Web, 2014, pp. 455–466.
[13]
M. Wylot, P. Cudré-Mauroux, and P. Groth, “ Executing provenance-enabled queries over Web data,” in Proc. 24th Int. Conf. World Wide Web, 2015, pp. 1275–1285.
[14]
M. Wylot, P. Cudré-Mauroux, and P. Groth, “ A demonstration of TripleProv: Tracking and querying provenance over Web data,” Proc. VLDB Endowment, vol. 8, no. 12, pp. 1992–1995, 2015.
[15]
L. Moreau, “The foundations for provenance on the Web,” Found. Trends Web Sci., vol. 2, no. 2/3, pp. 99– 241, Nov. 2010. [Online]. Available: http://eprints.ecs.soton.ac.uk/21691/
[16]
J. Cheney, L. Chiticariu, and W.-C. Tan, Provenance in databases: Why, How, and Where, vol. 1, no. 4. Breda, The Netherlands: Now Publishers Inc, 2009.
[17]
P. Groth, Y. Gil, J. Cheney, and S. Miles, “Requirements for provenance on the Web,” Int. J. Digit. Curation, vol. 7, no. 1, pp. 39–56, 2012.
[18]
P. Cudré-Mauroux, et al., “A demonstration of SciDB: A science-oriented DBMS,” Proc. VLDB Endowment, vol. 2, no. 2, pp. 1534–1537, 2009.
[19]
O. Hartig, “Provenance information in the Web of data,” in Proc. 2nd Workshop Linked Data Web, 2009.
[20]
S. Sahoo, et al., “Provenance vocabulary mappings,” W3C, Cambridge, MA, USA, 2010, https://www.w3.org/2005/Incubator/prov/wiki/Provenance_Vocabulary_Mappings
[21]
L. Moreau, et al., “The open provenance model core specification (v1.1),” Future Generation Comput. Syst., vol. 27, no. 6, pp. 743–756, Jun. 2011. [Online]. Available: http://eprints.ecs.soton.ac.uk/21449/
[22]
T. D. Huynh, P. Groth, and S. Zednik, Eds., “PROV Implementation Report,” World Wide Web Consortium, W3C Working Group Note NOTE-prov-implementations-20130430, Cambridge, MA, USA, Apr. 2013.
[23]
P. Hayes and B. McBride, “RDF semantics,” W3C Recommendation, Cambridge, MA, USA, Feb. 2004.
[24]
J. J. Carroll, C. Bizer, P. Hayes, and P. Stickler, “Named graphs, provenance and trust,” in Proc. 14th Int. Conf. World Wide Web, 2005, pp. 613–622.
[25]
V. Nguyen, O. Bodenreider, and A. Sheth, “ Don't like RDF reification?: Making statements about statements using singleton property,” in Proc. 23rd Int. Conf. World Wide Web, 2014, pp. 759– 770.
[26]
J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum, “YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia,” Artif. Intell., vol. 194, pp. 28–61, 2013.
[27]
J. Zhao, C. Bizer, Y. Gil, P. Missier, and S. Sahoo, “Provenance requirements for the next version of RDF,” in Proc. W3C Workshop RDF Next Steps, 2010.
[28]
P. Pediaditis, G. Flouris, I. Fundulaki, and V. Christophides, “On explicit provenance management in RDF/S graphs,” in Proc. Workshop Theory Practice Provenance, 2009, Art. no.
[29]
O. Udrea, D. R. Recupero, and V. Subrahmanian, “ Annotated RDF,” ACM Trans. Comput. Logic, vol. 11, no. 2, 2010, Art. no.
[30]
G. Flouris, I. Fundulaki, P. Pediaditis, Y. Theoharis, and V. Christophides, “Coloring RDF triples to capture provenance,” in Proc. 8th Int. Semantic Web Conf., 2009, pp. 196–212.
[31]
T. J. Green, G. Karvounarakis, and V. Tannen, “ Provenance semirings,” in Proc. 26th ACM SIGMOD-SIGACT-SIGART Symp. Principles Database Syst., 2007, pp. 31–40.
[32]
Y. Theoharis, I. Fundulaki, G. Karvounarakis, and V. Christophides, “On provenance of queries on Semantic Web data,” IEEE Internet Comput., vol. 15, no. 1, pp. 31–39, Jan. 2011.
[33]
C. V. Damásio, A. Analyti, and G. Antoniou, “ Provenance for SPARQL queries,” in Proc. 11th Int. Conf. Semantic Web, 2012, pp. 625–640.
[34]
A. Zimmermann, N. Lopes, A. Polleres, and U. Straccia, “A general framework for representing, reasoning and querying with annotated Semantic Web data,” Web Semantics, vol. 11, pp. 72–95, Mar. 2012.
[35]
O. Hartig, “Querying trust in RDF data with tSPARQL,” in Proc. 6th Eur. Semantic Web Conf. Semantic Web: Res. Appl., 2009, pp. 5 –20.
[36]
F. Geerts, G. Karvounarakis, V. Christophides, and I. Fundulaki, “Algebraic structures for capturing the provenance of SPARQL queries,” in Proc. 16th Int. Conf. Database Theory, 2013, pp. 153–164.
[37]
L. Moreau and I. Foster, “Electronically querying for the provenance of entities,” in Provenance and Annotation of Data, L. Moreau and I. Foster, Eds. Berlin, Germany: Springer, 2006, pp. 184–192.
[38]
O. Biton, S. Cohen-Boulakia, and S. B. Davidson, “ Zoom*UserViews: Querying relevant provenance in workflow systems,” in Proc. 33rd Int. Conf. Very Large Data Bases, 2007, pp. 1366– 1369.
[39]
L. M. Gadelha, Jr, M. Wilde, M. Mattoso, and I. Foster, “MTCProv: A practical provenance query framework for many-task scientific computing,” Distrib. Parallel Databases, vol. 30, no. 5/6, pp. 351–370, Oct. 2012.
[40]
A. Chebotko, S. Lu, X. Fei, and F. Fotouhi, “RDFProv: A relational RDF store for querying and managing scientific workflow provenance,” Data Knowl. Eng., vol. 69, no. 8, pp. 836–865, Aug. 2010.
[41]
G. Karvounarakis, Z. G. Ives, and V. Tannen, “Querying data provenance,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2010, pp. 951–962.
[42]
B. Glavic and G. Alonso, “The perm provenance management system in action,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2009, pp. 1055–1058 .
[43]
B. Arab, D. Gawlick, V. Radhakrishnan, H. Guo, and B. Glavic, “A generic provenance middleware for queries, updates, and transactions,” in Proc. 6th USENIX Workshop Theory Practice Provenance, Jun. 2014. [Online]. Available: https://www.usenix.org/conference/tapp2014/agenda/presentation/arab
[44]
J. Widom, “Trio: A system for integrated management of data, accuracy, and lineage,” Tech. Rep., Stanford InfoLab, Aug. 2004.
[45]
H. Halpin and J. Cheney, “Dynamic Provenance for SPARQL Updates,” in The Semantic Web , P. Mika, Eds., et al., Berlin, Germany: Springer, 2014, pp. 425–440. [Online]. Available: https://doi.org/10.1007/978-3-319-11964-9_27
[46]
G. Graefe and K. Ward, “Dynamic query evaluation plans,” SIGMOD Rec., vol. 18, no. 2, pp. 358–366, Jun. 1989.
[47]
R. L. Cole and G. Graefe, “Optimization of dynamic query evaluation plans,” SIGMOD Rec. , vol. 23, no. 2, pp. 150–160, May 1994.
[48]
N. Kabra and D. J. DeWitt, “Efficient mid-query re-optimization of sub-optimal query execution plans,” SIGMOD Rec., vol. 27, no. 2, pp. 106– 117, Jun. 1998.
[49]
K. Ng, Z. Wang, R. R. Muntz, and S. Nittel, “Dynamic query re-optimization,” in Proc. 11th Int. Conf. Sci. Statist. Database Manage., Aug. 1999, pp. 264–273.
[50]
R. Avnur and J. M. Hellerstein, “Eddies: Continuously adaptive query processing,” SIGMOD Rec., vol. 29, no. 2, pp. 261–272, May 2000.
[51]
S. Madden, M. Shah, J. M. Hellerstein, and V. Raman, “Continuously adaptive continuous queries over streams,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2002, pp. 49–60.
[52]
R. Cyganiak, A. Harth, and A. Hogan, “N-quads: Extending n-triples with context,” W3C Recommendation, Cambridge, MA, USA, 2008.
[53]
T. J. Green, “Collaborative data sharing with mappings and provenance,” Ph.D. dissertation, Dept. Comput. Inf. Sci., Univ. Pennsylvania, Philadelphia, PA, USA, 2009.
[54]
luc Moreau and G. Paul, Provenance: An Introduction to PROV. San Rafael, CA, USA: Morgan and Claypool, Sep. 2013. [Online]. Available: http://eprints.soton.ac.uk/356858/
[55]
C. Chichester, et al. “Querying neXtProt nanopublications and their value for insights on sequence variants and tissue expression,” Web Semantics: Sci. Services Agents World Wide Web, 2014. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1570826814000432
[56]
C. R. Batchelor, et al., “Scientific lenses to support multiple views over linked chemistry data,” in Proc. 13th Int. Semantic Web Conf., Oct. 2014, pp. 98–113. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-11964-9_7
[57]
L. Ding, T. Finin, Y. Peng, P. P. da Silva, and D. L. McGuinness, “Tracking RDF graph provenance using RDF molecules,” in Proc. 4th Int. Semantic Web Conf. (Poster), Apr. 2005, p. 42.
[58]
K. Wilkinson and K. Wilkinson, “Jena property table implementation,” in Proc. Int. Workshop Scalable Semantic Web Knowl. Base Syst., 2006, pp. 35–46.
[59]
T. Neumann and G. Weikum, “Scalable join processing on very large RDF graphs,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2009, pp. 627–640.
[60]
D. E. Difallah, A. Pavlo, C. Curino, and P. Cudré-Mauroux, “OLTP-bench: An extensible testbed for benchmarking relational databases,” Proc. VLDB Endowment, vol. 7, no. 4, pp. 277–288, 2013.

Cited By

View all
  • (2023)A Novel Graph Indexing Approach for Uncovering Potential COVID-19 Transmission ClustersACM Transactions on Knowledge Discovery from Data10.1145/353849217:2(1-24)Online publication date: 20-Feb-2023
  • (2022)What a Publication Tells You—Benefits of Narrative Information Access in Digital LibrariesProceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries10.1145/3529372.3530928(1-8)Online publication date: 20-Jun-2022
  • (2020)Modeling Narrative Structures in Logical Overlays on Top of Knowledge RepositoriesConceptual Modeling10.1007/978-3-030-62522-1_18(250-260)Online publication date: 3-Nov-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE Transactions on Knowledge and Data Engineering
IEEE Transactions on Knowledge and Data Engineering  Volume 29, Issue 8
Aug. 2017
202 pages

Publisher

IEEE Educational Activities Department

United States

Publication History

Published: 01 August 2017

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)A Novel Graph Indexing Approach for Uncovering Potential COVID-19 Transmission ClustersACM Transactions on Knowledge Discovery from Data10.1145/353849217:2(1-24)Online publication date: 20-Feb-2023
  • (2022)What a Publication Tells You—Benefits of Narrative Information Access in Digital LibrariesProceedings of the 22nd ACM/IEEE Joint Conference on Digital Libraries10.1145/3529372.3530928(1-8)Online publication date: 20-Jun-2022
  • (2020)Modeling Narrative Structures in Logical Overlays on Top of Knowledge RepositoriesConceptual Modeling10.1007/978-3-030-62522-1_18(250-260)Online publication date: 3-Nov-2020
  • (2020)Context-Compatible Information Fusion for Scientific Knowledge GraphsDigital Libraries for Open Knowledge10.1007/978-3-030-54956-5_3(33-47)Online publication date: 25-Aug-2020
  • (2019)Provenance compression scheme based on graph patterns for large RDF documentsThe Journal of Supercomputing10.1007/s11227-019-02926-276:8(6376-6398)Online publication date: 8-Jun-2019
  • (2019)Dataset search: a surveyThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-019-00564-x29:1(251-272)Online publication date: 24-Aug-2019
  • (2019)Metadata Discovery Using Data Sampling and Exploratory Data AnalysisModel and Data Engineering10.1007/978-3-030-32065-2_8(106-120)Online publication date: 28-Oct-2019
  • (2019)BTC-2019: The 2019 Billion Triple Challenge DatasetThe Semantic Web – ISWC 201910.1007/978-3-030-30796-7_11(163-180)Online publication date: 26-Oct-2019
  • (2018)An Investigative Study on the Quality Aspects of Linked Open DataProceedings of the 2018 International Conference on Cloud Computing and Internet of Things10.1145/3291064.3291074(33-39)Online publication date: 29-Oct-2018

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media