[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1871437.1871547acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Pattern discovery for large mixed-mode database

Published: 26 October 2010 Publication History

Abstract

In business and industry today, large databases with mixed data types (continuous and categorical) are very common. There are great needs to discover patterns from them for knowledge interpretation and understanding. In the past, for classification, this problem is solved as a discrete data problem by first discretizing the continuous data based on the class-attribute interdependence relationship. However, so far no proper solution exists when class information is unavailable. Hence, important pattern post-processing tasks such as pattern clustering and summarization cannot be applied to mixed-mode data. This paper presents a new method for solving the problem. It is based on two essential concepts. (1) Though class information is absent, yet for a correlated dataset, the attribute with the strongest interdependence with others in the group can be used to drive the discretization of the continuous data. (2) For a large database, correlated attribute groups must first be obtained by attribute clustering before (1) can be applied. Based on (1) and (2), pattern discovery methods are developed for mixed-mode data. Extensive experiments using synthetic and real world data were conducted to validate the usefulness and effectiveness of the proposed method.

References

[1]
Agrawal, R., Ghost, S., Imielinski, T., Iyer, B., and Swami, A. 1992. An interval classifier for database mining applications. In Proc. Int. Conf Very L. 560--573.
[2]
Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., and Levine, A. J. 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. In Proc. of the National Academy of Sciences of the United States of America. 96, 12, 6745--6750.
[3]
Asuncion, A., and Newman, D. J. 2007. UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. DOI= http://archive.ics.uci.edu/ml/.
[4]
Au, W. H., Chan, K. C. C., and Yao, X. 2003. A novel evolutionary data mining algorithm with applications to churn prediction. IEEE T. Evolut. Comput. 7, 6 (Dec. 2003), 532--545.
[5]
Au, W. H., Chan, K. C. C., Wong, A. K. C., and Wang, Y. 2005. Attribute clustering for grouping, selection, and classification of gene expression data. IEEE-ACM T. Comput. Bi. 2, 2 (Apr. 2005), 83--101.
[6]
Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M., and Yakhini, Z. 2000. Tissue classification with gene expression profiles. In Proceedings of the Fourth Annual International Conference on Computational Molecular Biology.
[7]
Chau, T., and Wong, A. K. C. 1999. Pattern discovery by residual analysis and recursive partitioning. IEEE T. Knowl. Data. En. 11, 6 (Nov. 1999), 833--854.
[8]
Ching, J. Y., Wong, A. K. C., and Chan, K. C. C. 1995. Class-dependent discretization for inductive learning from continuous and mixed-mode data. IEEE T. Pattern Anal. 17, 7(Jul. 1995), 631--641.
[9]
Chiu, D., Wong, A. C. K., and Cheung, B. 1990. Information discovery through hierarchical maximum entropy. J. Exp. Theor. Artif. In. 2, 117--129.
[10]
Ho, K. M., and Scott, P. D. 1997. Zeta: A global method for discretization of continuous variables. In D. Heckerman, H. Mannila, D. Pregibon, and R. Uthurusamy, editors, Knowledge Discovery and Data Mining, AAAI Press. 191--194.
[11]
Kohavi, R., John, G., Long, R., Manley, D., and Pfleger, K. 1994. Mlc++: a machine learning library in c. In Proc Int. C Tools Art.
[12]
Kurgan, L., and Cios, K. J. 2001. Discretization algorithm that uses class-attribute interdependence maximization. In Proceedings of the 2001 International Conference on Artificial Intelligence (IC-AI 2001), 980--987.
[13]
Liu, H., Hussain, F., Tan, C. L., and Dash, M. 2002. Discretization: an enabling technique. Data Min. Knowl. Disc. 6, 4 (Oct. 2002), 393--423.
[14]
Liu, L., Wong, A. K. C., and Wang, Y. 2004. A global optimal algorithm for class-dependent discretization of continuous data. Intell. Data Anal. 8, 2, 151--170.
[15]
Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kauffman, San, Mateo CA.
[16]
Wang, C. C., and Wong, A. K. C. 1979. Classification of discrete-valued data with feature space transformation. IEEE T. Automat. Contr. 24, 3, 434--437.
[17]
Wang, Y., and Wong, A. K. C. 2010. Discover*e. Pattern Discovery Technologies. DOI= http://www.patterndiscovery.com.
[18]
Wang, Y., and Wong, A. K. C. 2003. From association to classification: inference using weight of evidence. IEEE T. Knowl. Data. En. 15, 3, 914--925, 200.
[19]
Wong, A. K. C., and Wang, Y. 2003. Pattern discovery: a data driven approach to decision support. IEEE T Syst. Man Cy. C, 33, 1, 114--124.
[20]
Wong, A. K. C., and Chiu, D. K. Y. 1987. Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE T. Pattern Anal. 9, 8 (Nov. 1987), 796--805.
[21]
Wong, A. K. C., and Liu, T. S. 1975. Typicality, diversity and feature patterns of an ensemble. IEEE T. Comput. 24, 2 (Feb. 1975), 158--181.
[22]
Wong, A. K. C., and Wang, Y. 1997. High order pattern discovery from discrete-valued data. IEEE T. Knowl. Data. En. 9, 6 (Nov. 1997), 877--893.
[23]
Wong, A. K. C., Chiu D. K. Y., and Huang, W. 2001. A discrete-valued clustering algorithm with applications to bimolecular data. Information Sciences, 139, 1--2 (Nov. 2001), 97--112.
[24]
Wong, A. K. C., Liu, T. S., and Wang, C. C. 1976. Statistical analysis of residue variability in Cytochrome C. J. Mol. Biol. 102, 2(Apr. 1976), 287--295.
[25]
Wong, A. K. C., and Li, G. C. L. 2008. Simultaneous pattern and data clustering for pattern cluster analysis. IEEE T. Knowl. Data. En. 20, 7 (Jul. 2008), 911--923.
[26]
Wong, A. K. C., and Li, G. C. L. 2010. Association pattern analysis for pattern pruning, pattern clustering and summarization, to appear in Journal of Knowledge and Information Systems, 2010.

Cited By

View all
  • (2018)Discovery and disentanglement of aligned residue associations from aligned pattern clusters to reveal subgroup characteristicsBMC Medical Genomics10.1186/s12920-018-0417-z11:S5Online publication date: 20-Nov-2018
  • (2018)Clustering driving trip trajectory data based on pattern discovery techniques2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA)10.1109/ICBDA.2018.8367726(453-457)Online publication date: Mar-2018
  • (2018)Mining spatio-temporal patterns in multivariate spatial time series2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA)10.1109/ICBDA.2018.8367650(50-54)Online publication date: Mar-2018
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
October 2010
2036 pages
ISBN:9781450300995
DOI:10.1145/1871437
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 October 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. attribute clustering
  2. data mining
  3. mixed mode data
  4. mutual information
  5. pattern discovery
  6. unsupervised discretization

Qualifiers

  • Research-article

Conference

CIKM '10

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2018)Discovery and disentanglement of aligned residue associations from aligned pattern clusters to reveal subgroup characteristicsBMC Medical Genomics10.1186/s12920-018-0417-z11:S5Online publication date: 20-Nov-2018
  • (2018)Clustering driving trip trajectory data based on pattern discovery techniques2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA)10.1109/ICBDA.2018.8367726(453-457)Online publication date: Mar-2018
  • (2018)Mining spatio-temporal patterns in multivariate spatial time series2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA)10.1109/ICBDA.2018.8367650(50-54)Online publication date: Mar-2018
  • (2017)Privacy-preserving trajectory classification of driving trip data based on pattern discovery techniques2017 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2017.8258383(3816-3825)Online publication date: Dec-2017
  • (2016)An Effective Pattern Pruning and Summarization Method Retaining High Quality Patterns With High Area Coverage in Relational DatasetsIEEE Access10.1109/ACCESS.2016.26244184(7847-7858)Online publication date: 2016
  • (2014)A Model-Based Multivariate Time Series Clustering AlgorithmTrends and Applications in Knowledge Discovery and Data Mining10.1007/978-3-319-13186-3_72(805-817)Online publication date: 26-Nov-2014
  • (2010)Unsupervised discovery of fuzzy patterns in gene expression data2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)10.1109/BIBM.2010.5706575(269-273)Online publication date: Dec-2010

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media