Querying data provenance
Proceedings of the 2010 ACM SIGMOD International Conference on Management of …, 2010•dl.acm.org
Many advanced data management operations (eg, incremental maintenance, trust
assessment, debugging schema mappings, keyword search over databases, or query
answering in probabilistic databases), involve computations that look at how a tuple was
produced, eg, to determine its score or existence. This requires answers to queries such as,"
Is this data derivable from trusted tuples?";" What tuples are derived from this relation?"; or"
What score should this answer receive, given initial scores of the base tuples?". Such …
assessment, debugging schema mappings, keyword search over databases, or query
answering in probabilistic databases), involve computations that look at how a tuple was
produced, eg, to determine its score or existence. This requires answers to queries such as,"
Is this data derivable from trusted tuples?";" What tuples are derived from this relation?"; or"
What score should this answer receive, given initial scores of the base tuples?". Such …
Many advanced data management operations (e.g., incremental maintenance, trust assessment, debugging schema mappings, keyword search over databases, or query answering in probabilistic databases), involve computations that look at how a tuple was produced, e.g., to determine its score or existence. This requires answers to queries such as, "Is this data derivable from trusted tuples?"; "What tuples are derived from this relation?"; or "What score should this answer receive, given initial scores of the base tuples?". Such questions can be answered by consulting the provenance of query results.
In recent years there has been significant progress on formal models for provenance. However, the issues of provenance storage, maintenance, and querying have not yet been addressed in an application-independent way. In this paper, we adopt the most general formalism for tuple-based provenance, semiring provenance. We develop a query language for provenance, which can express all of the aforementioned types of queries, as well as many more; we propose storage, processing and indexing schemes for data provenance in support of these queries; and we experimentally validate the feasibility of provenance querying and the benefits of our indexing techniques across a variety of application classes and queries.
ACM Digital Library