[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/1557019.1557065acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Tell me something I don't know: randomization strategies for iterative data mining

Published: 28 June 2009 Publication History

Abstract

There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure.
In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.

Supplementary Material

JPG File (p379-hanhijarvi.jpg)
MP4 File (p379-hanhijarvi.mp4)

References

[1]
Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289--300, 1995.
[2]
Julian Besag. Markov chain Monte Carlo methods for statistical inference. http://www.ims.nus.edu.sg/Programs/mcmc/files/besag\_tl.pdf, 2004.
[3]
Julian Besag and Peter Clifford. Generalized Monte Carlo significance tests. Biometrica, 76(4):633--642, 1989.
[4]
T. Calders and B. Goethals. Non-derivable itemset mining. Data Mining and Knowledge Discovery, 14(1):171--206, 2007.
[5]
G. W. Cobb and Y.-P. Chen. An application of markov chain monte carlo to community ecology. American Mathematical Monthly, (110):264--288, 2003.
[6]
Mikael Fortelius and Jussi Eronen. Now - neogene of the old world. http://www.helsinki.fi/science/now/, 2005. Database of fossil mammals.
[7]
Arianna Gallo, Tijl Bie, and Nello Cristianini. Mini: Mining informative non-redundant itemsets. In Proc. of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases, 2007.
[8]
Charles J. Geyer. Markov chain monte carlo maximum likelihood. In Computing Science and Statistics: Proc. of 23rd Symposium on the Interface Foundation, 1991.
[9]
Aristides Gionis, Heikki Mannila, Taneli Mielikainen, and Panayiotis Tsaparas. Assessing data mining results via swap randomization. ACM Transactions on Knowledge Discovery from Data, 1(3), 2007.
[10]
Phillip Good. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer-Verlag, 2000.
[11]
S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 1979.
[12]
Szymon Jaroszewicz. Interactive hmm construction based on interesting sequences. In Local Patterns to Global Models (LeGo'08) Workshop at the the European Conference on Principles and Practice of Knowledge Discovery in Databases, 2008.
[13]
David Jensen. Knowledge discovery through induction with randomization testing. In Proc. of the 1991 Knowledge Discovery in Databases Workshop, 1991.
[14]
N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculation by fast computing machines. Journal of Chemical Physics, 21(1):1087--91, 1953.
[15]
Taneli Mielikainen and Heikki Mannila. The pattern ordering problem. Lecture Notes in Computer Science, 2838:327--338, 2003.
[16]
Markus Ojala, Niko Vuokko, Aleksi Kallio, Niina Haiminen, and Heikki Mannila. Randomization of real-valued matrices for assessing the significance of data mining results. In Proc. of the 2008 SIAM International Conference on Data Mining, 2008.
[17]
Christos H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994.
[18]
Geoffrey I. Webb. Discovering significant patterns. Mach. Learn., 68(1):1--33, 2007.
[19]
Geoffrey I. Webb. Layered critical values: A powerful direct-adjustment approach to discovering significant patterns. Machine Learning, 71(2{3):307--323, 2008.
[20]
Peter H. Westfall and S. Stanley Young. Resampling-based multiple testing: examples and methods for p-value adjustment. Wiley, 1993.

Cited By

View all

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
June 2009
1426 pages
ISBN:9781605584959
DOI:10.1145/1557019
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. inverse frequent set mining
  2. matrix randomization
  3. statistical significance

Qualifiers

  • Research-article

Conference

KDD09

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)1
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)What to expect from a set of itemsets?Information Sciences: an International Journal10.1016/j.ins.2021.12.115593:C(314-340)Online publication date: 1-May-2022
  • (2022)The minimum description length principle for pattern mining: a surveyData Mining and Knowledge Discovery10.1007/s10618-022-00846-z36:5(1679-1727)Online publication date: 4-Jul-2022
  • (2020)Supervised Human-Guided Data ExplorationMachine Learning and Knowledge Discovery in Databases10.1007/978-3-030-43823-4_8(85-101)Online publication date: 28-Mar-2020
  • (2019)A Constrained Randomization Approach to Interactive Visual Data Exploration with Subjective FeedbackIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.2907082(1-1)Online publication date: 2019
  • (2019)Interactive visual data exploration with subjective feedback: an information-theoretic approachData Mining and Knowledge Discovery10.1007/s10618-019-00655-xOnline publication date: 3-Oct-2019
  • (2019)A tutorial on statistically sound pattern discoveryData Mining and Knowledge Discovery10.1007/s10618-018-0590-x33:2(325-377)Online publication date: 1-Mar-2019
  • (2019)On Coupling FCA and MDL in Pattern MiningFormal Concept Analysis10.1007/978-3-030-21462-3_23(332-340)Online publication date: 23-May-2019
  • (2019)Tiler: Software for Human-Guided Data ExplorationMachine Learning and Knowledge Discovery in Databases10.1007/978-3-030-10997-4_49(672-676)Online publication date: 18-Jan-2019
  • (2018)Interactive Discovery of Coordinated Relationship Chains with Maximum Entropy ModelsACM Transactions on Knowledge Discovery from Data10.1145/304701712:1(1-34)Online publication date: 31-Jan-2018
  • (2017)Efficiently Discovering Locally Exceptional Yet Globally Representative Subgroups2017 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM.2017.29(197-206)Online publication date: Nov-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media