[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

A fast and effective method to find correlations among attributes in databases

Published: 01 June 2007 Publication History

Abstract

The problem of identifying meaningful patterns in a database lies at the very heart of data mining. A core objective of data mining processes is the recognition of inter-attribute correlations. Not only are correlations necessary for predictions and classifications – since rules would fail in the absence of pattern – but also the identification of groups of mutually correlated attributes expedites the selection of a representative subset of attributes, from which existing mappings allow others to be derived. In this paper, we describe a scalable, effective algorithm to identify groups of correlated attributes. This algorithm can handle non-linear correlations between attributes, and is not restricted to a specific family of mapping functions, such as the set of polynomials. We show the results of our evaluation of the algorithm applied to synthetic and real world datasets, and demonstrate that it is able to spot the correlated attributes. Moreover, the execution time of the proposed technique is linear on the number of elements and of correlations in the dataset.

References

[1]
Aha DW, Bankert RL (1995) A comparative evaluation of sequential feature selection algorithms. In: Proceedings of the 5th international workshop on artificial intelligence and statistics, Ft. Lauderdale, FL, USA, pp 1–7
[2]
Babu S, Garofalakis M, Rastogi R (2001) SPARTAN: a model-based semantic compression system for massive data tables. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data (SIGMOD’01), Santa Barbara, CA, USA, pp 283–294
[3]
Balan AGR, Traina AJM, Traina C Jr, Azevedo-Marques PM (2005) Fractal analysis of image textures for indexing and retrieval by content. In: Proceedings of the 18th IEEE symposium on computer-based medical systems (CBMS’05), Dublin, Ireland, pp 581–586
[4]
Barbara D and Chen P Using self-similarity to cluster large data sets Data Min Knowl Discov 2003 7 2 123-152
[5]
Belussi A, Faloutsos C (1995) Estimating the selectivity of spatial queries using the ‘Correlation Fractal Dimension’. In: Proceedings of the 21st international conference on very large data bases (VLDB’95), Zurich, Switzerland, pp 299–310
[6]
Belussi A and Faloutsos C Self-spatial join selectivity estimation using fractal concepts ACM Trans Inf Syst 1998 16 2 161-201
[7]
Böhm C A cost model for query processing in high dimensional data spaces ACM Trans Database Syst 2000 25 2 129-178
[8]
Böhm C, Kriegel H-P (2000) Dynamically optimizing high-dimensional index structures. In: Advances in Database Technology—EDBT 2000, 7th international conference on extending database technology. Lecture notes in computer science, vol 1777, Konstanz, Germany, pp 36–50
[9]
Blake C, Merz C (1998) UCI repository of machine learning databases. http://www.ics.uci. edu/∼mlearn/MLRePository.html
[10]
Blum AL and Langley P Selection of relevant features and examples in machine learning Artif Intell 1997 97 1–2 245-271
[11]
Calvo R, Partridge M, Jabri M (1998) A comparative study of principal component analysis techniques. In: Proceedings of the 9th Australian conference on neural networks, Brisbane, Australia
[12]
Chakrabarti D, Faloutsos C (2002) F4: large-scale automated forecasting using fractals. In: International conference on information and knowledge management (CIKM), vol 1, McLean, EUA, pp 2–9
[13]
Chakrabarti K, Keogh E, Mehrotra S, and Pazzani M Locally adaptive dimensionality reduction for indexing large time series databases ACM Trans Database Syst (TODS) 2002 27 2 188-228
[14]
Chang K and Ghosh J Principal curves for nonlinear feature extraction and classification Appl Artif Neural Netw Image Process III (SPIE) 1998 3307 120-129
[15]
Dash M, Liu H, Yao J (1997) Dimensionality reduction for unsupervised data. In: Proceedings of the 9th IEEE international conference on tools with artificial intelligence (ICTAI’97), Newport Beach, CA, USA, pp 532–539
[16]
DeMers D, Cottrell G (1993) Non-linear dimensionality reduction. In: Advances in neural information processing systems (NIPS conference), vol 5, Denver, CO, USA, pp 580–587
[17]
Faloutsos C, Seeger B, Traina AJM, Traina C (2000) Spatial join selectivity using power laws. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data (SIGMOD’00), Dallas, TX, USA, pp 177–188
[18]
Fayyad UM Mining databases: towards algorithms for knowledge discovery IEEE Data Eng Bull 1998 21 1 39-48
[19]
Hyvärinen A Survey on independent component analysis Neural Comput Surv 1999 2 94-128
[20]
Jebara TS, Jaakkola TS (2000) Feature selection and dualities in maximum entropy discrimination. In: Proceedings of the 16th conference on uncertainty in artificial intelligence (UAI’2000), San Francisco, CA, USA, pp 291–300
[21]
Jiang D, Tang C, and Zhang A Cluster analysis for gene expression data: a survey IEEE Trans Knowl Data Eng 2004 16 11 1370-1386
[22]
John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: Proceedings of the 11th international conference on machine learning (ICML’94), New Brunswick, NJ, USA, pp 121–129
[23]
Kamel I, Faloutsos C (1994) Hilbert R-tree: an improved R-tree using fractals. In: Proceedings of 20th international conference on very large data bases (VLDB’94), Santiago de Chile, Chile, pp 500–509
[24]
Kantardzic M, Sadeghian P, Shen C (2004) The time diversification monitoring of a stock portfolio: an approach based on the fractal dimension. In: Proceedings of the 2004 ACM symposium on applied computing (SAC’04), pp 637–641
[25]
Kanth KVR, Agrawal D, Singh AK (1998) Dimensionality reduction for similarity searching in dynamic databases. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data (SIGMOD’98), Seattle, WA, USA, pp 166–176
[26]
Keogh EJ, Chakrabarti K, Mehrotra S, Pazzani MJ (2001) Locally adaptive dimensionality reduction for indexing large time series databases. In: Proceedings of the 2001 ACM SIGMOD international conference on management of data (SIGMOD’01), Santa Barbara, CA, USA, pp 151–162
[27]
Kira K, Rendell LA (1992) A practical approach to feature selection. In: Proceedings of the 9th international workshop on machine learning (ML’92), Aberdeen, Scotland, pp 249–256
[28]
Kononenko I (1994) Estimating attributes: analysis and extensions of RELIEF. In: Proceeding of European conference on machine learning (ECML-94). Lecture notes in computer science, vol 784, Catania, Italy, pp 171–182
[29]
Korn F, Jagadish HV, Faloutsos C (1997) Efficiently supporting ad hoc queries in large datasets of time sequences. In: Proceedings of the 1997 ACM SIGMOD international conference on management of data (SIGMOD’97), Tucson, AZ, USA, pp 289–300
[30]
Korn F, Pagel B-U, and Faloutsos C On the ‘dimensionality curse’ and the self-similarity blessing IEEE Trans Knowl Data Eng 2001 13 1 96-111
[31]
Kramer MA Nonlinear principal component analysis using autoassociative neural networks Am Inst Chem Eng J (AIChE) 1991 37 2 233-243
[32]
Langley P, Sage S (1997) Scaling to domains with many irrelavant features. In: Computational learning theory and natural learning system, vol IV, Cambrige, MA, USA
[33]
Liu H and Yu L Toward integrating feature selection algorithms for classification and clustering IEEE Trans Knowl Data Eng 2005 17 4 491-502
[34]
Liu Y, Lazar NA, Rothfus WE, Buzoianu M, Kanade T (2001) Classification-driven feature space reduction for semantic-based neuroimage retrieval. In: VISIM workshop: information retrieval and exploration in large medical image collections, Utrecht, The Netherlands
[35]
Mitra P, Murthy C, and Pal S Unsupervised feature selection using feature similarity IEEE Trans Pattern Anal Mach Intel 2002 24 4 301-312
[36]
Moon B, Jagadish HV, Faloutsos C, and Saltz JH Analysis of the clustering properties of the Hilbert space-filling curve IEEE Trans Knowl Data Eng 2001 13 1 124-141
[37]
Pagel B-U, Korn F, Faloutsos C (2000) Deflating the dimensionality curse using multiple fractal dimensions. In: Proceedings of the 16th international conference on data engineering (ICDE’00), San Diego, CA, USA, pp 589–598
[38]
Quinlan JR C4.5: programs for machine learning 1993 San Francisco, CA, USA Morgan Kaufmann
[39]
Reynolds HT The analysis of cross-classifications 1977 New York, NY, USA The Free Press
[40]
Robnik-Sikonja M (1997) CORE—a system that predicts continuous variables. In: Proceedings of ERK’97, Portoroz, Slovenia
[41]
Roweis ST and Saul LK Nonlinear dimensionality reduction by locally linear embedding Science 2000 290 2323-2326
[42]
Scherf M, Brauer W (1997) Feature selection by means of a feature weighting approach. Technical report FKI-221-97, Munich University of Technology, Munich, Germany
[43]
Schölkopf AS and Müller K-R Nonlinear component analysis as a Kernel Eigenvalue problem Neural Comput 1998 10 1299-1319
[44]
Schroeder M Fractals, chaos, power laws: minutes from an infinite paradise 1991 New York, NY, USA W. H. Freeman and Company
[45]
Singh M, Provan GM (1995) A comparison of induction algorithms for selective and non-selective bayesian classifiers. In: Proceedings of the 12th international conference on machine learning (ICML’95), Tahoe City, CA, USA, pp 497–505
[46]
Takahashi T, Tokunaga R (1999) Nonlinear dimensionality reduction by multi layer perceptron using superposed energy. In: Proceedings of 1999 international symposium on nonlinear theory and its applications, vol 2, Hawaii, USA, pp 863–866
[47]
Tenenbaum JB, de Silva V, and Langford JC A global geometric framework for nonlinear dimensionality reduction Science 2000 290 2319-2323
[48]
Traina A, Traina C, Papadimitriou S, Faloutsos C (2001) Tri-Plots: scalable tools for multidimensional data mining. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’01), San Francisco, CA, USA, pp 184–193
[49]
Traina AJM, Castan̆ón CAB, Traina C (2003) MultiWaveMed: a system for medical image retrieval by wavelets transformations. In: Proceedings of the 16th IEEE symposium on computer-based medical systems, New York, NY, USA, pp 150–155
[50]
Traina C, Traina A, Wu L, Faloutsos C (2000) Fast feature selection using fractal dimension. In: Anais do XV Simpósio Brasileiro de Banco de Dados (SBBD’00), Joäo Pessoa, Brasil, pp 158–171
[51]
Traina C, Traina AJM, Faloutsos C (1999) Distance exponent: a new concept for selectivity estimation in metric trees. Technical report CMU-CS-99-110, Carnegie Mellon University, School of Computer Science, Pittsburgh, PA, USA
[52]
Traina C, Traina AJM, Faloutsos C, and Seeger B Fast indexing and visualization of metric data sets using Slim-Trees IEEE Trans Knowl Data Eng 2002 14 2 244-260
[53]
Turk M and Pentland A Eigenfaces for recognition J Cogn Neurosci 1991 3 1 71-86
[54]
Vafaie H, DeJong K (1993) Robust feature selection algorithms. In: Proceedings of the 5th international conference on tools with artificial intelligence (ICTAI ’93), Boston, MA, USA, pp 356–363
[55]
Vailaya A, Figueiredo MAT, Jain AK, and Zhang H-J Image classification for content-based indexing IEEE Trans Image Process 2001 10 1 117-130
[56]
Wactlar HD, Kanade T, Smith MA, and Stevens SM Intelligent access to digital video: informedia project IEEE Comput 1996 29 5 46-52
[57]
Witten IH and Frank E Data mining: practical machine learning tools with Java implementations 2000 San Francisco, CA, USA Morgan Kaufmann

Cited By

View all

Index Terms

  1. A fast and effective method to find correlations among attributes in databases
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Please enable JavaScript to view thecomments powered by Disqus.

          Information & Contributors

          Information

          Published In

          cover image Data Mining and Knowledge Discovery
          Data Mining and Knowledge Discovery  Volume 14, Issue 3
          Jun 2007
          126 pages

          Publisher

          Kluwer Academic Publishers

          United States

          Publication History

          Published: 01 June 2007
          Accepted: 19 July 2006
          Received: 07 July 2005

          Author Tags

          1. Attribute correlation
          2. Attribute selection
          3. Intrinsic dimension
          4. Fractals

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 28 Dec 2024

          Other Metrics

          Citations

          Cited By

          View all
          • (2017)The neural network model synthesis based on the fractal analysisOptical Memory and Neural Networks10.3103/S1060992X1704009926:4(257-273)Online publication date: 1-Oct-2017
          • (2017)Feature selection for regression problems based on the Morisita estimator of intrinsic dimensionPattern Recognition10.1016/j.patcog.2017.05.00870:C(126-138)Online publication date: 1-Oct-2017
          • (2016)Improving Multivariate Data Streams ClusteringProcedia Computer Science10.1016/j.procs.2016.05.32580:C(461-471)Online publication date: 1-Jun-2016
          • (2015)Weighted Similarity Schemes for High Scalability in User-Based Collaborative FilteringMobile Networks and Applications10.1007/s11036-014-0544-520:4(497-507)Online publication date: 1-Aug-2015
          • (2014)Open issues for partitioning clustering methodsWiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery10.1002/widm.11274:3(161-177)Online publication date: 1-May-2014
          • (2013)Analysis of large scale climate dataProceedings of the 22nd International Conference on World Wide Web10.1145/2487788.2487986(517-526)Online publication date: 13-May-2013
          • (2013)Unsupervised fuzzy-rough set-based dimensionality reductionInformation Sciences: an International Journal10.1016/j.ins.2012.12.001229(106-121)Online publication date: 1-Apr-2013
          • (2009)Feature selection with dynamic mutual informationPattern Recognition10.1016/j.patcog.2008.10.02842:7(1330-1339)Online publication date: 1-Jul-2009

          View Options

          View options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media