More Web Proxy on the site http://driver.im/

research-article

Tell me something I don't know: randomization strategies for iterative data mining

Authors:

Sami Hanhijärvi,

Kai Puolamäki,

Heikki MannilaAuthors Info & Claims

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 379 - 388

https://doi.org/10.1145/1557019.1557065

Published: 28 June 2009 Publication History

Abstract

There is a wide variety of data mining methods available, and it is generally useful in exploratory data analysis to use many different methods for the same dataset. This, however, leads to the problem of whether the results found by one method are a reflection of the phenomenon shown by the results of another method, or whether the results depict in some sense unrelated properties of the data. For example, using clustering can give indication of a clear cluster structure, and computing correlations between variables can show that there are many significant correlations in the data. However, it can be the case that the correlations are actually determined by the cluster structure.

In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account. The randomization methods can be used in iterative data mining. At each step in the data mining process, the randomization produces random samples from the set of data matrices satisfying the already discovered patterns or models. That is, given a data set and some statistics (e.g., cluster centers or co-occurrence counts) of the data, the randomization methods sample data sets having similar values of the given statistics as the original data set. We use Metropolis sampling based on local swaps to achieve this. We describe experiments on real data that demonstrate the usefulness of our approach. Our results indicate that in many cases, the results of, e.g., clustering actually imply the results of, say, frequent pattern discovery.

Supplementary Material

JPG File (p379-hanhijarvi.jpg)

Download
10.43 KB

MP4 File (p379-hanhijarvi.mp4)

Download
160.57 MB

References

[1]

Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 57(1):289--300, 1995.

[2]

Julian Besag. Markov chain Monte Carlo methods for statistical inference. http://www.ims.nus.edu.sg/Programs/mcmc/files/besag\_tl.pdf, 2004.

[3]

Julian Besag and Peter Clifford. Generalized Monte Carlo significance tests. Biometrica, 76(4):633--642, 1989.

[4]

T. Calders and B. Goethals. Non-derivable itemset mining. Data Mining and Knowledge Discovery, 14(1):171--206, 2007.

Digital Library

[5]

G. W. Cobb and Y.-P. Chen. An application of markov chain monte carlo to community ecology. American Mathematical Monthly, (110):264--288, 2003.

[6]

Mikael Fortelius and Jussi Eronen. Now - neogene of the old world. http://www.helsinki.fi/science/now/, 2005. Database of fossil mammals.

[7]

Arianna Gallo, Tijl Bie, and Nello Cristianini. Mini: Mining informative non-redundant itemsets. In Proc. of the 11th European conference on Principles and Practice of Knowledge Discovery in Databases, 2007.

[8]

Charles J. Geyer. Markov chain monte carlo maximum likelihood. In Computing Science and Statistics: Proc. of 23rd Symposium on the Interface Foundation, 1991.

[9]

Aristides Gionis, Heikki Mannila, Taneli Mielikainen, and Panayiotis Tsaparas. Assessing data mining results via swap randomization. ACM Transactions on Knowledge Discovery from Data, 1(3), 2007.

Digital Library

[10]

Phillip Good. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer-Verlag, 2000.

[11]

S. Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 1979.

[12]

Szymon Jaroszewicz. Interactive hmm construction based on interesting sequences. In Local Patterns to Global Models (LeGo'08) Workshop at the the European Conference on Principles and Practice of Knowledge Discovery in Databases, 2008.

[13]

David Jensen. Knowledge discovery through induction with randomization testing. In Proc. of the 1991 Knowledge Discovery in Databases Workshop, 1991.

[14]

N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculation by fast computing machines. Journal of Chemical Physics, 21(1):1087--91, 1953.

[15]

Taneli Mielikainen and Heikki Mannila. The pattern ordering problem. Lecture Notes in Computer Science, 2838:327--338, 2003.

[16]

Markus Ojala, Niko Vuokko, Aleksi Kallio, Niina Haiminen, and Heikki Mannila. Randomization of real-valued matrices for assessing the significance of data mining results. In Proc. of the 2008 SIAM International Conference on Data Mining, 2008.

[17]

Christos H. Papadimitriou. Computational Complexity. Addison-Wesley, 1994.

[18]

Geoffrey I. Webb. Discovering significant patterns. Mach. Learn., 68(1):1--33, 2007.

[19]

Geoffrey I. Webb. Layered critical values: A powerful direct-adjustment approach to discovering significant patterns. Machine Learning, 71(2{3):307--323, 2008.

Digital Library

[20]

Peter H. Westfall and S. Stanley Young. Resampling-based multiple testing: examples and methods for p-value adjustment. Wiley, 1993.

Cited By

Delacroix TLenca PLallich S(2022)What to expect from a set of itemsets?Information Sciences: an International Journal10.1016/j.ins.2021.12.115593:C(314-340)Online publication date: 1-May-2022
https://dl.acm.org/doi/10.1016/j.ins.2021.12.115
Galbrun E(2022)The minimum description length principle for pattern mining: a surveyData Mining and Knowledge Discovery10.1007/s10618-022-00846-z36:5(1679-1727)Online publication date: 4-Jul-2022
https://doi.org/10.1007/s10618-022-00846-z
Oikarinen EPuolamäki KKhoshrou SPechenizkiy M(2020)Supervised Human-Guided Data ExplorationMachine Learning and Knowledge Discovery in Databases10.1007/978-3-030-43823-4_8(85-101)Online publication date: 28-Mar-2020
https://doi.org/10.1007/978-3-030-43823-4_8
Show More Cited By

Index Terms

Tell me something I don't know: randomization strategies for iterative data mining

Recommendations

Tell me what i need to know: succinctly summarizing data with itemsets
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Data analysis is an inherently iterative process. That is, what we know about the data greatly determines our expectations, and hence, what result we would find the most interesting. With this in mind, we introduce a well-founded approach for succinctly ...
A further study on inverse frequent set mining
ADMA'05: Proceedings of the First international conference on Advanced Data Mining and Applications

Frequent itemset mining is a common task in data mining from which association rules are derived. As the frequent itemsets can be considered as a kind of summary of the original databases, recently the inverse frequent set mining problem has received ...
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets

As advances in technology allow for the collection, storage, and analysis of vast amounts of data, the task of screening and assessing the significance of discovered patterns is becoming a major challenge in data mining applications. In this work, we ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

June 2009

1426 pages

ISBN:9781605584959

DOI:10.1145/1557019

General Chairs:
John Elder
Elder Research, Inc., USA
,
Françoise Soulié Fogelman
KXEN, France
,
Program Chairs:
Peter Flach
University of Bristol, UK
,
Mohammed Zaki
RPI, USA

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD09

Sponsor:

KDD09: The 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

June 28 - July 1, 2009

Paris, France

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

55
Total Citations
View Citations
1,018
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Delacroix TLenca PLallich S(2022)What to expect from a set of itemsets?Information Sciences: an International Journal10.1016/j.ins.2021.12.115593:C(314-340)Online publication date: 1-May-2022
https://dl.acm.org/doi/10.1016/j.ins.2021.12.115
Galbrun E(2022)The minimum description length principle for pattern mining: a surveyData Mining and Knowledge Discovery10.1007/s10618-022-00846-z36:5(1679-1727)Online publication date: 4-Jul-2022
https://doi.org/10.1007/s10618-022-00846-z
Oikarinen EPuolamäki KKhoshrou SPechenizkiy M(2020)Supervised Human-Guided Data ExplorationMachine Learning and Knowledge Discovery in Databases10.1007/978-3-030-43823-4_8(85-101)Online publication date: 28-Mar-2020
https://doi.org/10.1007/978-3-030-43823-4_8
Kang BPuolamaki KLijffijt JDe Bie T(2019)A Constrained Randomization Approach to Interactive Visual Data Exploration with Subjective FeedbackIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2019.2907082(1-1)Online publication date: 2019
https://doi.org/10.1109/TKDE.2019.2907082
Puolamäki KOikarinen EKang BLijffijt JDe Bie T(2019)Interactive visual data exploration with subjective feedback: an information-theoretic approachData Mining and Knowledge Discovery10.1007/s10618-019-00655-xOnline publication date: 3-Oct-2019
https://doi.org/10.1007/s10618-019-00655-x
Hämäläinen WWebb G(2019)A tutorial on statistically sound pattern discoveryData Mining and Knowledge Discovery10.1007/s10618-018-0590-x33:2(325-377)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s10618-018-0590-x
Makhalova TKuznetsov SNapoli A(2019)On Coupling FCA and MDL in Pattern MiningFormal Concept Analysis10.1007/978-3-030-21462-3_23(332-340)Online publication date: 23-May-2019
https://doi.org/10.1007/978-3-030-21462-3_23
Henelius AOikarinen EPuolamäki K(2019)Tiler: Software for Human-Guided Data ExplorationMachine Learning and Knowledge Discovery in Databases10.1007/978-3-030-10997-4_49(672-676)Online publication date: 18-Jan-2019
https://doi.org/10.1007/978-3-030-10997-4_49
Wu HSun MMi PTatti NNorth CRamakrishnan N(2018)Interactive Discovery of Coordinated Relationship Chains with Maximum Entropy ModelsACM Transactions on Knowledge Discovery from Data10.1145/304701712:1(1-34)Online publication date: 31-Jan-2018
https://dl.acm.org/doi/10.1145/3047017
Kalofolias JBoley MVreeken J(2017)Efficiently Discovering Locally Exceptional Yet Globally Representative Subgroups2017 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM.2017.29(197-206)Online publication date: Nov-2017
https://doi.org/10.1109/ICDM.2017.29
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents