[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Graph clustering-based discretization of splitting and merging methods (GraphS and GraphM)

Published: 01 December 2017 Publication History

Abstract

Discretization plays a major role as a data preprocessing technique used in machine learning and data mining. Recent studies have focused on multivariate discretization that considers relations among attributes. The general goal of this method is to obtain the discrete data, which preserves most of the semantics exhibited by original continuous data. However, many techniques generate the final discrete data that may be less useful with natural groups of data not being maintained. This paper presents a novel graph clustering-based discretization algorithm that encodes different similarity measures into a graph representation of the examined data. The intuition allows more refined data-wise relations to be obtained and used with the effective graph clustering technique based on normalized association to discover nature graphs accurately. The goodness of this approach is empirically demonstrated over 30 standard datasets and 20 imbalanced datasets, compared with 11 well-known discretization algorithms using 4 classifiers. The results suggest the new approach is able to preserve the natural groups and usually achieve the efficiency in terms of classifier performance, and the desired number of intervals than the comparative methods.

References

[1]
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco
[2]
Sriwanna K, Puntumapon K, Waiyamai K (2012) An enhanced class-attribute interdependence maximization discretization algorithm. Springer, Berlin
[3]
Yang P, Li J-S, Huang Y-X (2011) Hdd: a hypercube division-based algorithm for discretisation. Int J Syst Sci 42(4):557---566
[4]
Bay SD (2001) Multivariate discretization for set mining. Knowl Inf Syst 3(4):491---512
[5]
de Sá CR, Soares C, Knobbe A (2016) Entropy-based discretization methods for ranking data. Information Sciences 329:921---936 (special issue on Discovery Science)
[6]
Ramírez-Gallego S, García S, Mouriño-Talín H, Martínez-Rego D, Bolón-Canedo V, Alonso-Betanzos A, Benítez JM, Herrera F (2016) Data discretization: taxonomy and big data challenge. Wiley Interdiscip Rev 6(1):5---21
[7]
Garcia S, Luengo J, Sáez JA, López V, Herrera F (2013) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734---750
[8]
Sang Y, Li K (2012) Combining univariate and multivariate bottom-up discretization. Multiple-Valued Logic and Soft Computing 20(1---2):161---187
[9]
Liu H, Hussain F, Tan CL, Dash M (2002) Discretization: an enabling technique. Data Min Knowl Discov 6(4):393---423
[10]
Dougherty J, Kohavi R, Sahami M et al (1995) Supervised and unsupervised discretization of continuous features. In: Machine learning: proceedings of the Twelfth international conference, vol 12, pp 194---202
[11]
Kerber R (1992) Chimerge: discretization of numeric attributes. In: Proceedings of the tenth national conference on artificial intelligence. Aaai Press, San Jose, pp 123---128
[12]
Liu H, Setiono R (1997) Feature selection via discretization. IEEE Trans Knowl Data Eng 9(4):642---645
[13]
Tay FE, Shen L (2002) A modified chi2 algorithm for discretization. IEEE Trans Knowl Data Eng 14(3):666---670
[14]
Sang Y, Qi H, Li K, Jin Y, Yan D, Gao S (2014) An effective discretization method for disposing high-dimensional data. Inf Sci 270:73---91
[15]
Kurgan LA, Cios KJ (2004) Caim discretization algorithm. IEEE Trans Knowl Data Eng 16(2):145---153
[16]
Cano A, Nguyen DT, Ventura S, Cios KJ (2016) ur-caim: improved caim discretization for unbalanced and balanced data. Soft Computing 20(1):173---188
[17]
Ching JY, Wong AK, Chan KCC (1995) Class-dependent discretization for inductive learning from continuous and mixed-mode data. IEEE Trans Pattern Anal Mach Intell 17(7):641---651
[18]
Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th international joint conference on artificial intelligence. Chambéry, France, 28 Aug---3 Sept 1993, pp 1022---1029
[19]
Catlett J (1991) On changing continuous attributes into ordered discrete attributes. In: Kodratoff Y. (eds) Machine Learning -- EWSL-91. EWSL 1991. Lecture notes in computer science (Lecture notes in artificial intelligence), vol 482. Springer, Berlin
[20]
Zeinalkhani M, Eftekhari M (2014) Fuzzy partitioning of continuous attributes through discretization methods to construct fuzzy decision tree classifiers. Inf Sci 278:715---735
[21]
Yang Y, Webb GI (2009) Discretization for naive-bayes learning: managing discretization bias and variance. Mach Learn 74(1):39---74
[22]
Kang Y, Wang S, Liu X, Lai H, Wang H, Miao B (2006) An ICA-based multivariate discretization algorithm. Springer, Berlin
[23]
Gupta A, Mehrotra KG, Mohan C (2010) A clustering-based discretization for supervised learning. Stat Probab Lett 80(9):816---824
[24]
Singh GK, Minz S (2007) Discretization using clustering and rough set theory. In: International conference on computing: theory and applications, 2007. ICCTA'07. IEEE, New York, pp 330---336
[25]
Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. Appl Stat 28:100---108
[26]
Ertoz L, Steinbach M, Kumar V (2002) A new shared nearest neighbor clustering algorithm and its applications. In: Workshop on clustering high dimensional data and its applications at 2nd SIAM international conference on data mining, pp 105---115
[27]
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226---231
[28]
Sriwanna K, Boongoen T, Iam-On N (2016) In: Lavangnananda K, Phon-Amnuaisuk S, Engchuan W, Chan JH (eds) An enhanced univariate discretization based on cluster ensembles. Springer, Cham, pp 85---98
[29]
Iam-On N, Boongoen T, Garrett S, Price C (2011) A link-based approach to the cluster ensemble problem. IEEE Trans Pattern Anal Mach Intell 33(12):2396---2409
[30]
Huang X, Zheng X, Yuan W, Wang F, Zhu S (2011) Enhanced clustering of biomedical documents using ensemble non-negative matrix factorization. Inf Sci 181(11):2293---2302
[31]
Ramirez-Gallego S, Garcia S, Benitez JM, Herrera F (2016) Multivariate discretization based on evolutionary cut points selection for classification. IEEE Transactions on Cybernetics 46(3):595---608
[32]
Parashar A, Gulati Y (2012) Survey of di erent partition clustering algorithms and their comparative studies. International Journal of Advanced Research in Computer Science 3(3):675---680
[33]
Brandes U, Gaertler M, Wagner D (2007) Engineering graph clustering: models and experimental evaluation. ACM J Exp Algorithm 12(1.1):1---26
[34]
Van Dongen SM (2001) Graph clustering by ow simulation. PhD thesis, University of Utrecht
[35]
Schaeffer SE (2007) Graph clustering. Comput Sci Rev 1(1):27---64
[36]
Foggia P, Percannella G, Sansone C, Vento M (2009) Benchmarking graph-based clustering algorithms. Image Vis Comput 27(7):979---988
[37]
Zhou Y, Cheng H, Yu JX (2009) Graph clustering based on structural/attribute similarities. Proc VLDB Endow 2(1):718---729
[38]
Cheng H, Zhou Y, Yu JX (2011) Clustering large attributed graphs: a balance between structural and attribute similarities. ACM Trans Knowl Discov Data 5(2):12
[39]
Nascimento MC, De Carvalho AC (2011) Spectral methods for graph clustering-a survey. Eur J Oper Res 211(2):221---231
[40]
Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888---905
[41]
Foggia P, Percannella G, Sansone C, Vento M (2007) In: Escolano F, Vento M (eds) Assessing the performance of a graph-based clustering algorithm. Springer, Berlin, pp 215---227
[42]
Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30(7):1575---1584
[43]
Kannan R, Vempala S, Vetta A (2004) On clusterings: good, bad and spectral. J ACM 51(3):497---515
[44]
Brandes U, Gaertler M, Wagner D (2003) Experiments on graph clustering algorithms. Springer, Berlin, pp 568---579
[45]
Kong W, Hu S, Zhang J, Dai G (2013) Robust and smart spectral clustering from normalized cut. Neural Comput Appl 23(5):1503---1512
[46]
Sen D, Gupta N, Pal SK (2013) Incorporating local image structure in normalized cut based graph partitioning for grouping of pixels. Inf Sci 248:214---238
[47]
Cha S-H (2007) Comprehensive survey on distance/similarity measures between probability density functions. City 1(2):1
[48]
Everitt B, Landau S, Leese M (1993) Cluster analysis (Edward Arnold, London). ISBN 0-470-22043-0
[49]
Soman KP, Diwakar S, Ajay V (2006) Data mining: theory and practice {with CD}. PHI Learn
[50]
Chapanond A (2007) Application aspects of data mining analysis on evolving graphs. PhD thesis, Troy
[51]
Boutin F, Hascoet M (2004) Cluster validity indices for graph partitioning. In: Proceedings, eighth international conference on information visualisation, 2004. IV 2004. IEEE, New York, pp 376---381
[52]
Dua S, Chowriappa P (2012) Data mining for bioinformatics. CRC Press, Boca Raton
[53]
Görke R, Kappes A, Wagner D (2014) Experiments on density-constrained graph clustering. J Exp Algorithmics 19:6
[54]
Leighton T, Rao S (1988) An approximate max-flow min-cut theorem for uniform multicommodity flow problems with applications to approximation algorithms. In: 29th annual symposium on foundations of computer science, 1988. IEEE, New York, pp 422---431
[55]
Ding CH, He X, Zha H, Gu M, Simon HD (2001) A min-max cut algorithm for graph partitioning and data clustering. In: Proceedings IEEE international conference on data mining, 2001, ICDM 2001. IEEE, New York, pp 107---114
[56]
Mohar B, Alavi Y (1991) The laplacian spectrum of graphs. Graph Theory Comb Appl 2:871---898
[57]
Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395---416
[58]
Lichman M (2013) UCI machine learning repository
[59]
Alcalá J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2010) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Log Soft Comput 17(255---287):11
[60]
Alcalá-Fdez J, Sánchez L, García S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM, Fernández JC, Herrera F (2009) Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307---318
[61]
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco
[62]
Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37---66
[63]
John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. Proceedings of the eleventh conference on uncertainty in artificial intelligence. UAI'95. Morgan Kaufmann Publishers Inc., San Francisco, pp 338---345
[64]
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273---297
[65]
Wu X, Kumar V (2009) The top ten algorithms in data mining, 1st edn. Chapman & Hall/CRC, Boca Raton
[66]
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1---37
[67]
Kohavi R et al (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. Ijcai 14:1137---1145
[68]
Bradley AP (1997) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145---1159
[69]
Huang J, Ling CX (2005) Using auc and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299---310
[70]
Ruan J, Jahid MJ, Gu F, Lei C, Huang YW, Hsu YT, Mutch DG, Chen CL, Kirma NB, Huang THM (2016) A novel algorithm for network-based prediction of cancer recurrence. Genomics.
[71]
Lv J, Peng Q, Chen X, Sun Z (2016) A multi-objective heuristic algorithm for gene expression microarray data classification. Expert Syst Appl 59:13---19
[72]
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675---701
[73]
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1):86---92
[74]
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1---30
[75]
García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044---2064
[76]
Holm S (1979) A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6:65---70
[77]
Gonzalez-Abril L, Cuberos FJ, Velasco F, Ortega JA (2009) Ameva: an autonomous discretization algorithm. Expert Syst Appl 36(3):5327---5332
[78]
Tsai C-J, Lee C-I, Yang W-P (2008) A discretization algorithm based on class-attribute contingency coefficient. Inf Sci 178(3):714---731
[79]
Eshelman LJ (1991) The CHC adaptive search algorithm: how to have safe search when engaging in nontraditional genetic recombination. Found Genet Algorithms 1:265---283
[80]
Zighed DA, Rabaséda S, Rakotomalala R (1998) Fusinter: a method for discretization of continuous attributes. Int J Uncertain Fuzziness Knowl Based Syst 6(03):307---326
[81]
Wong AKC, Liu TS (1975) Typicality, diversity, and feature pattern of an ensemble. IEEE Trans Comput 100(2):158---181
[82]
Huang W (1997) Discretization of continuous attributes for inductive machine learning. Toledo, Department Computer Science, University of Toledo
[83]
Ho KM, Scott PD (1997) Zeta: a global method for discretization of continuous variables. In: Proceedings of the third international conference knowledge discovery and data mining (KDD97), pp 191---194
[84]
Healey J (2014) Statistics: a tool for social research. Cengage Learn

Cited By

View all
  • (2020)An Expert Approach for Data Flow Prediction: Case Study of Wireless Sensor NetworksWireless Personal Communications: An International Journal10.1007/s11277-020-07028-4112:1(325-352)Online publication date: 20-Jan-2020
  • (2019)Hybrid case-base maintenance approach for modeling large scale case-based reasoning systemsHuman-centric Computing and Information Sciences10.1186/s13673-019-0171-z9:1(1-25)Online publication date: 1-Dec-2019
  • (2019)A semantic approach to improving machine readability of a large-scale attack graphThe Journal of Supercomputing10.1007/s11227-018-2394-675:6(3028-3045)Online publication date: 1-Jun-2019
  • Show More Cited By
  1. Graph clustering-based discretization of splitting and merging methods (GraphS and GraphM)

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image Human-centric Computing and Information Sciences
      Human-centric Computing and Information Sciences  Volume 7, Issue 1
      December 2017
      729 pages
      ISSN:2192-1962
      EISSN:2192-1962
      Issue’s Table of Contents

      Publisher

      Springer-Verlag

      Berlin, Heidelberg

      Publication History

      Published: 01 December 2017

      Author Tags

      1. Data mining
      2. Graph clustering
      3. Multivariate discretization
      4. Normalized association
      5. Normalized cuts

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 04 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2020)An Expert Approach for Data Flow Prediction: Case Study of Wireless Sensor NetworksWireless Personal Communications: An International Journal10.1007/s11277-020-07028-4112:1(325-352)Online publication date: 20-Jan-2020
      • (2019)Hybrid case-base maintenance approach for modeling large scale case-based reasoning systemsHuman-centric Computing and Information Sciences10.1186/s13673-019-0171-z9:1(1-25)Online publication date: 1-Dec-2019
      • (2019)A semantic approach to improving machine readability of a large-scale attack graphThe Journal of Supercomputing10.1007/s11227-018-2394-675:6(3028-3045)Online publication date: 1-Jun-2019
      • (2019)Genetic algorithm-based adaptive weight decision method for motion estimation frameworkThe Journal of Supercomputing10.1007/s11227-018-2247-375:4(1909-1921)Online publication date: 1-Apr-2019
      • (2019)Graph clustering-based discretization approach to microarray dataKnowledge and Information Systems10.1007/s10115-018-1249-z60:2(879-906)Online publication date: 1-Aug-2019

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media