As most current query processing architectures are already pipelined, it seems logical to apply them to data streams. However, two classes of query operators are impractical for processing long or unbounded data streams. Unbounded stateful operators maintain state with no upper bound on its size, and so eventually run out of memory. Blocking operators read the entire input before emitting a single output, and so might never produce a result. We believe that a priori semantic knowledge of a data stream can permit the use of such operators in some cases. We explore a kind of stream semantics called punctuated streams . Punctuations in a stream mark the end of substreams, allowing us to view a non-terminating stream as a mixture of terminating streams. We introduce three kinds of invariants to specify the proper behavior of query operators in the presence of punctuation. Pass invariants unblock blocking operators by defining when such an operator can pass results on. Keep invariants define what must be kept in local state to continue successful operation. Propagation invariants define when an operator can pass punctuation on. We then present a strategy for proving that implementations of these invariants are faithful to their finite table counterparts.
In practice, it is important to answer the following question: “How much additional overhead is required when using punctuations__ __” We use the scenario of a monitoring system for an online auction. Streams of bids, new items, and new users are sent to an online auction system. There are many interesting queries that can be posed over these auction streams. We define queries for this scenario, and execute them with different kinds and amounts of punctuations embedded in the input streams. We show that, for a reasonable ratio of punctuations to data items, the overhead is minimal. Additionally, we compare the behavior of a query using punctuations with the behavior of the same query using slack over data streams with disorder.
Clearly, not all punctuations are useful to a particular query, and it would be useful to make a determination of when they are. That is, we would like to answer the question “Can stream query Q benefit from a particular set of punctuations__ __” To that end, we first define punctuation schemes to specify the collection of punctuations that will be presented to a query on a particular data stream. We show how both punctuations and query operators induce groupings over the items in the domain of the input(s). We show that a query benefits from an input punctuation scheme (in terms of being able to produce a given output scheme), if each set in the groupings induced by the operators of the query is covered by a finite number of punctuations in the scheme—a kind of compactness .
We conclude with discussion on possible future directions of research related to punctuations and data streams. These directions focus on a variety of questions, ranging from issues in query optimization to other possible semantics that can be expressed using punctuations.
Cited By
- Gorawski M and Chrószcz A The Design of Stream Database Engine in Concurrent Environment Proceedings of the Confederated International Conferences, CoopIS, DOA, IS, and ODBASE 2009 on On the Move to Meaningful Internet Systems: Part II, (1033-1049)
- Gorawski M and Chrószcz A Query processing using negative and temporal tuples in stream query engines Proceedings of the 4th IFIP TC 2 Central and East European conference on Advances in Software Engineering Techniques, (70-83)
- Golab L, Johnson T, Seidel J and Shkapenyuk V Stream warehousing with DataDepot Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, (847-854)
- Li J, Tufte K, Shkapenyuk V, Papadimos V, Johnson T and Maier D (2008). Out-of-order processing, Proceedings of the VLDB Endowment, 1:1, (274-288), Online publication date: 1-Aug-2008.
Index Terms
- Punctuated data streams
Recommendations
Exploiting Punctuation Semantics in Continuous Data Streams
As most current query processing architectures are already pipelined, it seems logical to apply them to data streams. However, two classes of query operators are impractical for processing long or infinite data streams. Unbounded stateful operators ...
Safety guarantee of continuous join queries over punctuated data streams
VLDB '06: Proceedings of the 32nd international conference on Very large data basesContinuous join queries (CJQ) are needed for correlating data from multiple streams. One fundamental problem for processing such queries is that since the data streams are infinite, this would require the join operator to store infinite states and ...
Sketching distributed sliding-window data streams
While traditional data management systems focus on evaluating single, ad hoc queries over static data sets in a centralized setting, several emerging applications require (possibly, continuous) answers to queries on dynamic data that is widely ...