More Web Proxy on the site http://driver.im/

Article

Aggregate computation over data streams

Authors:

Ying ZhangAuthors Info & Claims

APWeb'08: Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development

Pages 10 - 25

Published: 26 April 2008 Publication History

Abstract

Nowadays, we have witnessed the widely recognized phenomenon of high speed data streams. Various statistics computation over data streams is often required by many applications, including processing of relational type queries, data mining and high speed network management. In this paper, we provide survey for three important kinds of aggregate computations over data streams: frequency moment, frequency count and order statistic.

References

[1]

Aduri, P., Tirthapura, S.: Range efficient computation of f<inf>0</inf> over massive data streams. In: ICDE, pp. 32-43 (2005).

Digital Library

[2]

Ahmad, Y., Berg, B., Çetintemel, U., Humphrey, M., Hwang, J.-H., Jhingran, A., Maskey, A., Papaemmanouil, O., Rasin, A., Tatbul, N., Xing, W., Xing, Y., Zdonik, S.B.: Distributed operation in the borealis stream processing engine. In: SIGMOD, pp. 882-884 (2005).

Digital Library

[3]

Ajtai, M., Jayram, T.S., Kumar, R., Sivakumar, D.: Approximate counting of inversions in a data stream. In: STOC, pp. 370-379 (2002).

Digital Library

[4]

Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating the frequency moments. In: STOCK, pp. 20-29 (1996).

Digital Library

[5]

Arasu, A., Manku, G.S.: Approximate counts and quantiles over sliding windows. In: PODS, pp. 286-296 (2004).

Digital Library

[6]

Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS (2002).

Digital Library

[7]

Babcock, B., Olston, C.: Distributed top-k monitoring. In: SIGMOD, pp. 28-39 (2003).

Digital Library

[8]

Bandi, N., Agrawal, D., Abbadi, A.E.: Fast algorithms for heavy distinct hitters using associative memories. In: IEEE International Conference on Distributed Computing Systems(ICDCS), p. 6 (2007).

Digital Library

[9]

Bandi, N., Metwally, A., Agrawal, D., Abbadi, A.E.: Fast data stream algorithms using associative memories. In: SIGMOD, pp. 247-256 (2007).

Digital Library

[10]

Bar-Yossef, Z., Jayram, T.S., Kumar, R., Sivakumar, D., Trevisan, L.: Counting distinct elements in a data stream. In: Randomization and Approximation Techniques, 6th International Workshop, RANDOM, pp. 1-10 (2002).

Digital Library

[11]

Bar-Yossef, Z., Kumar, R., Sivakumar, D.: Reductions in streaming algorithms, with an application to counting triangles in graphs. In: SODA, pp. 623-632 (2002).

Digital Library

[12]

Bawa, M., Molina, H.G., Gionis, A., Motwani, R.: Estimating aggregates on a peer-to-peer network. Technical report, Stanford University (2003).

[13]

Buriol, L.S., Frahling, G., Leonardi, S., Marchetti-Spaccamela, A., Sohler, C.: Counting triangles in data streams. In: PODS, pp. 253-262 (2006).

Digital Library

[14]

Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Seidman, G., Stonebraker, M., Tatbul, N., Zdonik, S.B.: Monitoring streams - a new class of data management applications. In: Bressan, S., Chaudhri, A.B., Li Lee, M., Yu, J.X., Lacroix, Z. (eds.) CAiSE 2002 and VLDB 2002. LNCS, vol. 2590, pp. 215-226. Springer, Heidelberg (2003).

Digital Library

[15]

Chang, Y.-C., Bergman, L.D., Castelli, V., Li, C.-S., Lo, M.-L., Smith, J.R.: The onion technique: Indexing for linear optimization queries. In: SIGMOD, pp. 391- 402 (2000).

Digital Library

[16]

Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Widmayer, P., Triguero, F., Morales, R., Hennessy, M., Eidenbenz, S., Conejo, R. (eds.) ICALP 2002. LNCS, vol. 2380, pp. 693-703. Springer, Heidelberg (2002).

Digital Library

[17]

Chen, J., DeWitt, D.J., Tian, F., Wang, Y.: Niagaracq: A scalable continuous query system for internet databases. In: SIGMOD, pp. 379-390 (2000).

Digital Library

[18]

Cohen, E.: Size-estimation framework with applications to transitive closure and reachability. J. Comput. Syst. Sci. 55(3), 441-453 (1997).

Digital Library

[19]

Cohen, S., Matias, Y.: Spectral bloom filters. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 241-252 (2003).

Digital Library

[20]

Considine, J., Li, F., Kollios, G., Byers, J.W.: Approximate aggregation techniques for sensor databases. In: ICDE, pp. 449-460 (2004).

Digital Library

[21]

Coppersmith, D., Kumar, R.: An improved data stream algorithm for frequency moments. In: SODA, pp. 151-156 (2004).

Digital Library

[22]

Cormode, G., Garofalakis, M.N.: Sketching streams through the net: Distributed approximate query tracking. In: VLDB, pp. 13-24 (2005).

Digital Library

[23]

Cormode, G., Garofalakis, M.N., Muthukrishnan, S., Rastogi, R.: Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In: SIGMOD, pp. 25-36 (2005).

Digital Library

[24]

Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Finding hierarchical heavy hitters in data streams. In: VLDB, pp. 464-475 (2003).

Digital Library

[25]

Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Diamond in the rough: Finding hierarchical heavy hitters in multi-dimensional data. In: SIGMOD, pp. 155-166 (2004).

Digital Library

[26]

Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Effective computation of biased quantiles over data streams. In: ICDE, pp. 20-31 (2005).

Digital Library

[27]

Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Space- and time-efficient deterministic algorithms for biased quantiles over data streams. In: PODS, pp. 263- 272 (2006).

Digital Library

[28]

Cormode, G., Muthukrishnan, S.: What's hot and what's not: tracking most frequent items dynamically. In: PODS, pp. 296-306 (2003).

Digital Library

[29]

Cormode, G., Muthukrishnan, S.: An improved data stream summary: The count-min sketch and its applications. In: Farach-Colton, M. (ed.) LATIN 2004. LNCS, vol. 2976, pp. 29-38. Springer, Heidelberg (2004).

[30]

Cormode, G., Muthukrishnan, S.: Space efficient mining of multigraph streams. In: PODS, pp. 271-282 (2005).

Digital Library

[31]

Cormode, G., Muthukrishnan, S., Zhuang, W.: What's different: Distributed, continuous monitoring of duplicate-resilient aggregates on data streams. In: ICDE, p. 57 (2006).

Digital Library

[32]

Cranor, C.D., Johnson, T., Spatscheck, O., Shkapenyuk, V.: Gigascope: A stream database for network applications. In: SIGMOD, pp. 647-651 (2003).

Digital Library

[33]

Das, G., Gunoplulos, D., Koudas, N., Sarkas, N.: Ad-hoc top-k query answering for data streams. In: VLDB (2007).

Digital Library

[34]

Das, G., Gunopulos, D., Koudas, N., Tsirogiannis, D.: Answering top-k queries using views. In: VLDB, pp. 451-462 (2006).

Digital Library

[35]

Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows (extended abstract). In: SODA, pp. 635-644 (2002).

Digital Library

[36]

Demaine, E.D., López-Ortiz, A., Munro, J.I.: Frequency estimation of internet packet streams with limited space. In: Möhring, R.H., Raman, R. (eds.) ESA 2002. LNCS, vol. 2461, pp. 348-360. Springer, Heidelberg (2002).

Digital Library

[37]

Durand, M., Flajolet, P.: Loglog counting of large cardinalities (extended abstract). In: Di Battista, G., Zwick, U. (eds.) ESA 2003. LNCS, vol. 2832, pp. 605-617. Springer, Heidelberg (2003).

[38]

Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for the web. In: WWW, pp. 613-622 (2001).

Digital Library

[39]

Estan, C., Varghese, G.: New directions in traffic measurement and accounting. In: Proceedings of the conference on Applications, technologies, architectures, and protocols for computer communications(SIGCOMM) (2002).

Digital Library

[40]

Estan, C., Varghese, G.: New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst. 21(3), 270-313 (2003).

Digital Library

[41]

Estan, C., Varghese, G., Fisk, M.: Bitmap algorithms for counting active flows on high speed links. In: ACM SIGCOMM Conference on Internet Measurement, pp. 153-166 (2003).

Digital Library

[42]

Fagin, R.: Combining fuzzy information from multiple systems. In: PODS, pp. 216-226 (1996).

Digital Library

[43]

Fagin, R.: Fuzzy queries in multimedia database systems. In: PODS, pp. 1-10 (1998).

Digital Library

[44]

Fagin, R.: Combining fuzzy information from multiple systems. J. Comput. Syst. Sci. 58(1), 83-99 (1999).

Digital Library

[45]

Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: PODS (2001).

Digital Library

[46]

Flajolet, P., Martin, G.N.: Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci. 31(2), 182-209 (1985).

Digital Library

[47]

Ganguly, S., Cormode, G.: On Estimating Frequency Moments of Data Streams. In: Charikar, M., Jansen, K., Reingold, O., Rolim, J.D.P. (eds.) RANDOM 2007 and APPROX 2007. LNCS, vol. 4627, pp. 479-493. Springer, Heidelberg (2007).

Digital Library

[48]

Gibbons, P.B.: Distinct sampling for highly-accurate answers to distinct values queries and event reports. In: VLDB, pp. 541-550 (2001).

Digital Library

[49]

Gibbons, P.B., Tirthapura, S.: Estimating simple functions on the union of data streams. In: SPAA, pp. 281-291 (2001).

Digital Library

[50]

Gibbons, P.B., Tirthapura, S.: Distributed streams algorithms for sliding windows. In: SPAA, pp. 63-72 (2002).

Digital Library

[51]

Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.: How to summarize the universe: Dynamic maintenance of quantiles. In: VLDB, pp. 454-465 (2002).

Digital Library

[52]

Golab, L., DeHaan, D., Demaine, E.D., López-Ortiz, A., Munro, J.I.: Identifying frequent items in sliding windows over on-line packet streams. In: ACM SIGCOMM Conference on Internet Measurement, pp. 173-178 (2003).

Digital Library

[53]

Govindaraju, N.K., Raghuvanshi, N., Manocha, D.: Fast and approximate stream mining of quantiles and frequencies using graphics processors. In: SIGMOD, pp. 611-622 (2005).

Digital Library

[54]

Greenwald, M., Khanna, S.: Space-efficient online computation of quantile summaries. In: SIGMOD, pp. 58-66 (2001).

Digital Library

[55]

Greenwald, M., Khanna, S.: Power-conserving computation of order-statistics over sensor networks. In: PODS, pp. 275-285 (2004).

Digital Library

[56]

Guha, S., McGregor, A.: Approximate quantiles and the order of the stream. In: PODS, pp. 273-279 (2006).

Digital Library

[57]

Gupta, A., Zane, F.: Counting inversions in lists. In: SODA, pp. 253-254 (2003).

Digital Library

[58]

Hadjieleftheriou, M., Byers, J.W., Kollios, G.: Robust sketching and aggregation of distributed data streams. Technical report. Boston University (2005).

[59]

Hellerstein, J.M., Franklin, M.J., Chandrasekaran, S., Deshpande, A., Hildrum, K., Madden, S., Raman, V., Shah, M.A.: Adaptive query processing: Technology in evolution. IEEE Data Eng. Bull. 23(2), 7-18 (2000).

[60]

Hershberger, J., Shrivastava, N., Suri, S., Tóth, C.D.: Space complexity of hierarchical heavy hitters in multi-dimensional data streams. In: PODS, pp. 338-347 (2005).

Digital Library

[61]

Hristidis, V., Koudas, N., Papakonstantinou, Y.: Prefer: A system for the efficient execution of multi-parametric ranked queries. In: SIGMOD, pp. 259-270 (2001).

Digital Library

[62]

Indyk, P., Woodruff, D.P.: Optimal approximations of the frequency moments of data streams. In: STOCK, pp. 202-208 (2005).

Digital Library

[63]

Jin, C., Qian, W., Sha, C., Yu, J.X., Zhou, A.: Dynamically maintaining frequent items over a data stream. In: CIKM, pp. 287-294 (2003).

Digital Library

[64]

Jin, W., Ester, M., Han, J.: Efficient processing of ranked queries with sweeping selection. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 527-535. Springer, Heidelberg (2005).

[65]

Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. 28, 51-55 (2003).

Digital Library

[66]

Keralapura, R., Cormode, G., Ramamirtham, J.: Communication-efficient distributed monitoring of thresholded counts. In: SIGMOD, pp. 289-300 (2006).

Digital Library

[67]

Korn, F., Muthukrishnan, S., Srivastava, D.: Reverse nearest neighbor aggregates over data streams. In: Bressan, S., Chaudhri, A.B., Li Lee, M., Yu, J.X., Lacroix, Z. (eds.) CAiSE 2002 and VLDB 2002. LNCS, vol. 2590, pp. 814-825. Springer, Heidelberg (2003).

Digital Library

[68]

Lee, L.K., Ting, H.F.: A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In: PODS, pp. 290-297 (2006).

Digital Library

[69]

Lin, X., Lu, H., Xu, J., Yu, J.X.: Continuously maintaining quantile summaries of the most recent n elements over a data stream. In: ICDE, pp. 362-374 (2004).

Digital Library

[70]

Lin, X., Xu, J., Zhang, Q., Lu, H., Yu, J.X., Zhou, X., Yuan, Y.: Approximate processing of massive continuous quantile queries over high-speed data streams. IEEE Trans. Knowl. Data Eng. 18(5), 683-698 (2006).

Digital Library

[71]

Manganelli, S., Engle, R.: Value at risk models in finance. In: European Central Bank Working Paper Series No. 75 (2001).

[72]

Manjhi, A., Nath, S., Gibbons, P.B.: Tributaries and deltas: Efficient and robust aggregation in sensor network streams. In: SIGMOD, pp. 287-298 (2005).

Digital Library

[73]

Manjhi, A., Shkapenyuk, V., Dhamdhere, K., Olston, C.: Finding (recently) frequent items in distributed data streams. In: ICDE, pp. 767-778 (2005).

Digital Library

[74]

Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Bressan, S., Chaudhri, A.B., Li Lee, M., Yu, J.X., Lacroix, Z. (eds.) CAiSE 2002 and VLDB 2002. LNCS, vol. 2590, pp. 346-357. Springer, Heidelberg (2003).

Digital Library

[75]

Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Approximate medians and other quantiles in one pass and with limited memory. In: SIGMOD, pp. 426-435 (1998).

Digital Library

[76]

Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Random sampling techniques for space efficient online computation of order statistics of large datasets. In: SIGMOD, pp. 251-262 (1999).

Digital Library

[77]

Metwally, A., Agrawal, D., Abbadi, A.E.: Efficient computation of frequent and top-k elements in data streams. In: Eiter, T., Libkin, L. (eds.) ICDT 2005. LNCS, vol. 3363, pp. 398-412. Springer, Heidelberg (2004).

Digital Library

[78]

Misra, J., Gries, D.: Finding repeated elements. Sci. Comput. Program. 2(2), 143- 152 (1982).

[79]

Mouratidis, K., Bakiras, S., Papadias, D.: Continuous monitoring of top-k queries over sliding windows. In: SIGMOD, pp. 635-646 (2006).

Digital Library

[80]

Munro, J.I., Paterson, M.: Selection and sorting with limited storage. Theor. Comput. Sci. 12, 315-323 (1980).

[81]

Muthukrishnan, S.: Data streams: algorithms and applications. In: SODA, pp. 413-413 (2003).

Digital Library

[82]

Nath, S., Gibbons, P.B., Seshan, S., Anderson, Z.R.: Synopsis diffusion for robust aggregation in sensor networks. In: SenSys, pp. 250-262 (2004).

Digital Library

[83]

Papadias, D., Tao, Y., Fu, G., Seeger, B.: Progressive skyline computation in database systems. ACM Trans. Database Syst. 30(1), 41-82 (2005).

Digital Library

[84]

Poosala, V., Ioannidis, Y.E.: Estimation of query-result distribution and its application in parallel-join load balancing. In: VLDB, pp. 448-459 (1996).

Digital Library

[85]

Shrivastava, N., Buragohain, C., Agrawal, D., Suri, S.: Medians and beyond: new aggregation techniques for sensor networks. In: SenSys, pp. 239-249 (2004).

Digital Library

[86]

STREAM stream data manager, http://www-db.stanford.edu/stream/sqr

[87]

Tao, Y., Hadjieleftheriou, M.: Processing ranked queries with the minimum space. In: Dix, J., Hegner, S.J. (eds.) FoIKS 2006. LNCS, vol. 3861, pp. 294-312. Springer, Heidelberg (2006).

Digital Library

[88]

Tao, Y., Hristidis, V., Papadias, D., Papakonstantinou, Y.: Branch-and-bound processing of ranked queries. Inf. Syst. 32(3), 424-445 (2007).

Digital Library

[89]

Tao, Y., Xiao, X., Pei, J.: Efficient skyline and top-k retrieval in subspaces. IEEE Trans. Knowl. Data Eng (to appear, 2007).

Digital Library

[90]

Tsaparas, P., Palpanas, T., Kotidis, Y., Koudas, N., Srivastava, D.: Ranked join indices. In: ICDE, pp. 277-288 (2003).

[91]

Venkataraman, S., Song, D.X., Gibbons, P.B., Blum, A.: New streaming algorithms for fast detection of superspreaders. In: NDSS (2005).

[92]

Whang, K.-Y., Zanden, B.T.V., Taylor, H.M.: A linear-time probabilistic counting algorithm for database applications. ACM Trans. Database Syst. 15(2), 208-229 (1990).

Digital Library

[93]

Xin, D., Chen, C., Han, J.: Towards robust indexing for ranked queries. In: VLDB, pp. 235-246 (2006).

Digital Library

[94]

Yao, Y., Gehrke, J.: The cougar approach to in-network query processing in sensor networks. SIGMOD Record 31(3), 9-18 (2002).

Digital Library

[95]

Yi, K., Yu, H., Yang, J., Xia, G., Chen, Y.: Efficient maintenance of materialized top-k views. In: ICDE, pp. 189-200 (2003).

[96]

Zhang, Y., Lin, X., Xu, J., Korn, F., Wang, W.: Space-efficient relative error order sketch over data streams. In ICDE, page 51 (2006).

Digital Library

[97]

Zhang, Y., Lin, X., Yuan, Y., Kitsuregawa, M., Zhou, X., Yu, J.X.: Summarizing order statistics over data streams with duplicates. In: ICDE, pp. 1329-1333 (2007).

[98]

Zhu, Y., Shasha, D.: Statstream: Statistical monitoring of thousands of data streams in real time. In: Bressan, S., Chaudhri, A.B., Li Lee, M., Yu, J.X., Lacroix, Z. (eds.) CAiSE 2002 and VLDB 2002. LNCS, vol. 2590, pp. 358-369. Springer, Heidelberg (2003).

Digital Library

Cited By

Ezzati-Jivan NDagenais M(2015)Cube data model for multilevel statistics computation of live execution tracesConcurrency and Computation: Practice & Experience10.1002/cpe.327227:5(1069-1091)Online publication date: 10-Apr-2015
https://dl.acm.org/doi/10.1002/cpe.3272

Index Terms

Aggregate computation over data streams
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
  2. Information retrieval
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

Duplicate-Insensitive Order Statistics Computation over Data Streams

Duplicates in data streams may often be observed by the projection on a subspace and/or multiple recordings of objects. Without the uniqueness assumption on observed data elements, many conventional aggregates computation problems need to be further ...
Punctuated data streams
Statistical mining in data streams

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

APWeb'08: Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development

April 2008

698 pages

ISBN:3540788484

Editors:
Yanchun Zhang
Victoria University, School of Computer Science and Mathematics, Melbourne, VIC, Australia
,
Guandong Xu
Victoria University, School of Computer Science and Mathematics, Melbourne, VIC, Australia
,
Ge Yu
Northeastern University, Department of Computer Science and Engineering, Shenyang, China
,
Elisa Bertino
Purdue University, Department of Computer Science, West Lafayette, IN

Sponsors

NSF of China: National Natural Science Foundation of China

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 26 April 2008

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
7
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ezzati-Jivan NDagenais M(2015)Cube data model for multilevel statistics computation of live execution tracesConcurrency and Computation: Practice & Experience10.1002/cpe.327227:5(1069-1091)Online publication date: 10-Apr-2015
https://dl.acm.org/doi/10.1002/cpe.3272

View Options

View options

Media

Figures

Other

Tables

View Table of Contents