Abstract
Modern databases have to cope with multi-dimensional queries. For efficient processing of these queries, query optimization relies on multi-dimensional selectivity estimation techniques. These techniques in turn typically rely on histograms. A core challenge of histogram construction is the detection of regions with a density higher than the ones of their surroundings. In this paper, we show that subspace clustering algorithms, which detect such regions, can be used to build high quality histograms in multi-dimensional spaces. The clusters are transformed into a memory-efficient histogram representation, while preserving most of the information for the selectivity estimation. We derive a formal criterion for our transformation of clusters into buckets that minimizes the introduced estimation error. In practice, finding optimal buckets is hard, so we propose a heuristic. Our experiments show that our approach is efficient in terms of both runtime and memory usage. Overall, we demonstrate that subspace clustering enables multi-dimensional selectivity estimation with low estimation errors.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aboulnaga, A., Chaudhuri, S.: Self-tuning histograms: building histograms without looking at data. In: SIGMOD 1999 (1999)
Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. SIGMOD Record 28(2) (1999)
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Record 27(2) (1998)
Amenta, N.: Bounded boxes, hausdorff distance, and a new proof of an interesting helly-type theorem. In: SCG 1994 (1994)
Assent, I., Krieger, R., Müller, E., Seidl, T.: DUSC: Dimensionality unbiased subspace clustering. In: ICDM 2007 (2007)
Assent, I., Krieger, R., Müller, E., Seidl, T.: INSCY: Indexing subspace clusters with in-process-removal of redundancy. In: ICDM 2008 (2008)
Baltrunas, L., Mazeika, A., Bohlen, M.: Multi-dimensional histograms with tight bounds for the error
Beyer, K.S, Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1998)
Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a multidimensional workload-aware histogram. SIGMOD Record (2001)
Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction to Algorithms. MIT Press, Cambridge (2009)
Fuchs, D., He, Z., Lee, B.S.: Compressed histograms with arbitrary bucket layouts for selectivity estimation. Inf. Sci. 177, 680–702 (2007)
Gunopulos, D., Kollios, G., Tsotras, V.J., Domeniconi, C.: Approximating multi-dimensional aggregate range queries over real attributes. SIGMOD Record (2000)
Halkidi, M., Vazirgiannis, M.: Clustering validity assessment: finding the optimal partitioning of a data set. In: ICDM 2001 (2001)
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)
Ioannidis, Y.: The history of histograms (abridged). In: VLDB 2003 (2003)
Jolliffe, I.: Principal Component Analysis. Springer, New York (1986)
Kröger, P., Kriegel, H.P., Kailing, K.: Density-connected subspace clustering for high-dimensional data. In: SDM 2004 (2004)
Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J.: Understanding of internal clustering validation measures
Luo, J., Zhou, X., Zhang, Y., Shen, H.T., Li, J.: Selectivity estimation by batch-query based histogram and parametric method. In: ADC 2007 (2007)
Müller, E., Assent, I., Günnemann, S., Krieger, R., Seidl, T.: Relevant Subspace Clustering: mining the most interesting non-redundant concepts in high dimensional data. In: ICDM 2009 (2009)
Müller, E., Günnemann, S., Assent, I., Seidl, T.: Evaluating clustering in subspace projections of high dimensional data. In: PVLDB, vol. 2(1) (2009)
Muthukrishnan, S., Poosala, V., Suel, T.: On rectangular partitionings in two dimensions: Algorithms, complexity, and applications
Poosala, V., Ioannidis, Y.E.: Selectivity estimation without the attribute value independence assumption. In: VLDB 1997 (1997)
Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M.: A monte carlo algorithm for fast projective clustering. In: SIGMOD 2002 (2002)
Sequeira, K., Zaki, M.J.: Schism: A new approach for interesting subspace mining. In: ICDM 2004 (2004)
Srivastava, U., Haas, P., Markl, V., Kutsch, M., Tran, T.: ISOMER: Consistent histogram construction using query feedback. In: ICDE 2006 (2006)
Wang, H., Sevcik, K.C.: A multi-dimensional histogram for selectivity estimation and fast approximate query answering. In: CASCON 2003 (2003)
Yiu, M.L., Mamoulis, N.: Frequent-pattern based iterative projected clustering. In: ICDM 2003 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Khachatryan, A., Müller, E., Böhm, K., Kopper, J. (2011). Efficient Selectivity Estimation by Histogram Construction Based on Subspace Clustering. In: Bayard Cushing, J., French, J., Bowers, S. (eds) Scientific and Statistical Database Management. SSDBM 2011. Lecture Notes in Computer Science, vol 6809. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22351-8_22
Download citation
DOI: https://doi.org/10.1007/978-3-642-22351-8_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22350-1
Online ISBN: 978-3-642-22351-8
eBook Packages: Computer ScienceComputer Science (R0)