Maintaining nonparametric estimators over data streams
Abstract
An effective processing and analysis of data streams is of utmost importance for a plethora of emerging applications like network monitoring, traffic management, and financial tickers. In addition to the management of transient and potentially unbounded streams, their analysis with advanced data mining techniques has been identified as a research challenge. A well-established class of mining techniques is based on nonparametric statistics where especially nonparametric density estimation is among the essential building blocks. In this paper, we examine the maintenance of nonparametric estimators over data streams. We present a tailored framework that incrementally maintains a nonparametric estimator over a data stream while consuming only a fixed amount of memory. Our framework is memory-adaptive and therefore, supports a fundamental requirement for an operator within a data stream management system. As an example, we apply our framework to selectivity estimation of range queries, which is a popular use-case for statistical estimators. After providing an analysis of the processing cost, results of experimental comparisons are reported where synthetic data streams as well as real-world ones are considered. Our results demonstrate the accuracy of the results being produced by estimators derived from our framework.
Full Text: PDF