[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Finding Robust Itemsets under Subsampling

Published: 07 October 2014 Publication History

Abstract

Mining frequent patterns is plagued by the problem of pattern explosion, making pattern reduction techniques a key challenge in pattern mining. In this article we propose a novel theoretical framework for pattern reduction by measuring the robustness of a property of an itemset such as closedness or nonderivability. The robustness of a property is the probability that this property holds on random subsets of the original data. We study four properties, namely an itemset being closed, free, non-derivable, or totally shattered, and demonstrate how to compute the robustness analytically without actually sampling the data. Our concept of robustness has many advantages: Unlike statistical approaches for reducing patterns, we do not assume a null hypothesis or any noise model and, in contrast to noise-tolerant or approximate patterns, the robust patterns for a given property are always a subset of the patterns with this property. If the underlying property is monotonic then the measure is also monotonic, allowing us to efficiently mine robust itemsets. We further derive a parameter-free technique for ranking itemsets that can be used for top-k approaches. Our experiments demonstrate that we can successfully use the robustness measure to reduce the number of patterns and that ranking yields interesting itemsets.

References

[1]
R. Agrawal, T. Imielinski, and A. N. Swami. 1993. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'93). 207--216.
[2]
A. Asuncion and D. Newman. 2007. UCI machine learning repository. http://mlearn.ics.uci.edu/MLRepository.html.
[3]
J.-F. Boulicaut, A. Bykowski, and C. Rigotti. 2000. Approximation of frequency queries by means of free-sets. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD'00). 75--85.
[4]
J.-F. Boulicaut, A. Bykowski, and C. Rigotti. 2003. Free-sets: A condensed representation of boolean data for the approximation of frequency queries. Data Mining Knowl. Discov. 7, 1, 5--22.
[5]
S. Brin, R. Motwani, and C. Silverstein. 1997. Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'97). 265--276.
[6]
B. Bringmann and A. Zimmermann. 2009. One in a million: Picking the right patterns. Knowl. Inf. Syst. 18, 1, 61--81.
[7]
T. Calders and B. Goethals. 2007. Non-derivable itemset mining. Data Mining Knowl. Discov. 14, 1, 171--206.
[8]
T. Calders, B. Goethals, and M. Mampaey. 2007. Mining itemsets in the presence of missing values. In Proceedings of the ACM Symposium on Applied Computing (SAC'07). 404--408.
[9]
T. Calders, C. Rigotti, and J.-F. Boulicaut. 2006. A survey on condensed representations for frequent sets. In Proceedings of the European Conference on Constraint-Based Mining and Inductive Databases. 64--80.
[10]
H. Cheng, X. Yan, J. Han, and C. Hsu. 2007. Discriminative frequent pattern analysis for effective classification. In Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE'07). 716--725.
[11]
H. Cheng, P. S. Yu, and J. Han. 2006. AC-close: Efficiently mining approximate closed itemsets by core pattern recovery. In Proceedings of the 6th International Conference on Data Mining (ICDM'06). 839--844.
[12]
F. Coenen. 2003. The lucs-kdd discretised/normalised arm and carm data library. https://cgi.csc.liv.ac.uk/∼frans/KDD/Software/LUCS-KDD-DN/OLDversion/lucs-kdd_DN.html.
[13]
T. de Bie. 2011. Maximum entropy models and subjective interestingness: An application to tiles in binary databases. Data Mining Knowl. Discov. 23, 3, 407--446.
[14]
A. Gallo, T. de Bie, and N. Cristianini. 2007. Mini: Mining informative non-redundant itemsets. In Proceedings of the 11th Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD'07). 438--445.
[15]
F. Geerts, B. Goethals, and T. Mielikäinen. 2004. Tiling databases. In Discovery Science. Springer, 278--289.
[16]
G. Ghoshal and A.-L. Barabási. 2011. Ranking stability and super-stable nodes in complex networks. Nature Comm. 2.
[17]
A. Gionis, H. Mannila, T. Mielikäinen, and P. Tsaparas. 2007. Assessing data mining results via swap randomization. ACM Trans. Knowl. Discov. Data 1, 3.
[18]
B. Goethals and M. Zaki. 2003. FIMI'03, frequent itemset mining implementations. In Proceedings of the 3rd International Conference on Data Ming Workshops on Frequent Itemset Mining Implementations (FIMI'03). 1--13.
[19]
R. Gupta, G. Fang, B. Field, M. Steinbach, and V. Kumar. 2008. Quantitative evaluation of approximate frequent pattern mining algorithms. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'08). 301--309.
[20]
S. Hanhijärvi, M. Ojala, N. Vuokko, K. Puolamäki, N. Tatti, and H. Mannila. 2009. Tell me something I don't know: Randomization strategies for iterative data mining. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'09). 379--388.
[21]
J. Hipp, U. Güntzer, and G. Nakhaeizadeh. 2000. Algorithms for association rule mining - A general survey and comparison. SIGKDD Explor. 2, 1, 58--64.
[22]
C. Lucchese, S. Orlando, and R. Perego. 2010. Mining top-k patterns from binary datasets in presence of noise. In Proceedings of the SIAM International Conference on Data Mining (SDM'10).
[23]
T. Mielikäinen. 2005. Transaction databases, frequent itemsets, and their condensed representations. In Proceedings of the 4th International Conference on Knowledge Discovery in Inductive Databases (KDID'05). 139--164.
[24]
G. Misra, B. Golshan, and E. Terzi. 2012. A framework for evaluating the smoothness of data-mining results. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD'12). 660--675.
[25]
F. Moerchen, M. Thies, and A. Ultsch. 2010. Efficient mining of all margin-closed itemsets with applications in temporal knowledge discovery and classification by compression. Knowl. Inf. Syst. 29, 1, 55--80.
[26]
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. 1999. Discovering frequent closed itemsets for association rules. In Proceedings of the 7th International Conference on Database Theory (ICDT'99). 398--416.
[27]
J. Pei, J. Han, and L. V. S. Lakshmanan. 2001. Mining frequent itemsets with convertible constraints. In Proceedings of the International Conference on Data Engineering (ICDE'01). 433--442.
[28]
K. Smets and J. Vreeken. 2011. The odd one out: Identifying and characterising anomalies. In Proceedings of the SIAM International Conference on Data Mining (SDM'11).
[29]
N. Tatti. 2008. Maximum entropy based significance of itemsets. Knowl. Inf. Syst. 17, 1, 57--77.
[30]
N. Tatti and F. Moerchen. 2011. Finding robust itemsets under subsampling. In Proceedings of the 11th IEEE International Conference on Data Mining (ICDM'11). 705--714.
[31]
T. Uno and H. Arimura. 2007. An efficient polynomial delay algorithm for pseudo frequent itemset mining. In Discovery Science. Springer, 219--230.
[32]
J. Vreeken, M. van Leeuwen, and A. Siebes. 2011. Krimp: Mining itemsets that compress. Data Mining Knowl. Discov. 23, 1, 169--214.
[33]
K. Wang, C. Xu, and B. Liu. 1999. Clustering transactions using large items. In Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM'99). 483--490.
[34]
G. I. Webb. 2007. Discovering significant patterns. Mach. Learn. 68, 1, 1--33.
[35]
D. Xin, J. Han, X. Yan, and H. Cheng. 2005. Mining compressed frequent-pattern sets. In Proceedings of the 31st International Conference on Very Large Data Bases (VLDB'05). 709--720.
[36]
Y. Zhao and G. Karypis. 2002. Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of the 11th International Conference on Information and Knowledge Management (CIKM'02). 515--524.

Cited By

View all
  • (2024)Mining actionable concepts in concept lattice using Interestingness PropagationJournal of Computational Science10.1016/j.jocs.2023.10219675(102196)Online publication date: Jan-2024
  • (2024)Neural networks for intelligent multilevel control of artificial and natural objects based on data fusion: A surveyInformation Fusion10.1016/j.inffus.2024.102427110(102427)Online publication date: Oct-2024
  • (2022)Knowledge cores in large formal contextsAnnals of Mathematics and Artificial Intelligence10.1007/s10472-022-09790-690:6(537-567)Online publication date: 1-Jun-2022
  • Show More Cited By

Index Terms

  1. Finding Robust Itemsets under Subsampling

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Database Systems
    ACM Transactions on Database Systems  Volume 39, Issue 3
    September 2014
    264 pages
    ISSN:0362-5915
    EISSN:1557-4644
    DOI:10.1145/2676651
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 October 2014
    Accepted: 01 February 2013
    Revised: 01 November 2012
    Received: 01 June 2012
    Published in TODS Volume 39, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Pattern reduction
    2. closed itemsets
    3. free itemsets
    4. non-derivable itemsets
    5. robust itemsets
    6. totally shattered itemsets

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)11
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Mining actionable concepts in concept lattice using Interestingness PropagationJournal of Computational Science10.1016/j.jocs.2023.10219675(102196)Online publication date: Jan-2024
    • (2024)Neural networks for intelligent multilevel control of artificial and natural objects based on data fusion: A surveyInformation Fusion10.1016/j.inffus.2024.102427110(102427)Online publication date: Oct-2024
    • (2022)Knowledge cores in large formal contextsAnnals of Mathematics and Artificial Intelligence10.1007/s10472-022-09790-690:6(537-567)Online publication date: 1-Jun-2022
    • (2020)Formal Concept Analysis: From Knowledge Discovery to Knowledge ProcessingA Guided Tour of Artificial Intelligence Research10.1007/978-3-030-06167-8_13(411-445)Online publication date: 8-May-2020
    • (2019)Approximating concept stability using variance reduction techniquesDiscrete Applied Mathematics10.1016/j.dam.2019.03.002Online publication date: Mar-2019
    • (2018)On interestingness measures of formal conceptsInformation Sciences: an International Journal10.1016/j.ins.2018.02.032442:C(202-219)Online publication date: 1-May-2018
    • (2017)Efficient Mining of Subsample-Stable Graph Patterns2017 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM.2017.88(757-762)Online publication date: Nov-2017
    • (2016)Interactive data exploration with smart drill-down2016 IEEE 32nd International Conference on Data Engineering (ICDE)10.1109/ICDE.2016.7498300(906-917)Online publication date: May-2016
    • (2015)Fast Generation of Best Interval Patterns for Nonmonotonic ConstraintsMachine Learning and Knowledge Discovery in Databases10.1007/978-3-319-23525-7_10(157-172)Online publication date: 29-Aug-2015

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media