More Web Proxy on the site http://driver.im/

article

Graph clustering-based discretization of splitting and merging methods (GraphS and GraphM)

Authors:

Kittakorn Sriwanna,

Tossapon Boongoen,

Natthakan Iam-OnAuthors Info & Claims

Human-centric Computing and Information Sciences, Volume 7, Issue 1

Article No.: 103, Pages 1 - 39

https://doi.org/10.1186/s13673-017-0103-8

Published: 01 December 2017 Publication History

Abstract

Discretization plays a major role as a data preprocessing technique used in machine learning and data mining. Recent studies have focused on multivariate discretization that considers relations among attributes. The general goal of this method is to obtain the discrete data, which preserves most of the semantics exhibited by original continuous data. However, many techniques generate the final discrete data that may be less useful with natural groups of data not being maintained. This paper presents a novel graph clustering-based discretization algorithm that encodes different similarity measures into a graph representation of the examined data. The intuition allows more refined data-wise relations to be obtained and used with the effective graph clustering technique based on normalized association to discover nature graphs accurately. The goodness of this approach is empirically demonstrated over 30 standard datasets and 20 imbalanced datasets, compared with 11 well-known discretization algorithms using 4 classifiers. The results suggest the new approach is able to preserve the natural groups and usually achieve the efficiency in terms of classifier performance, and the desired number of intervals than the comparative methods.

References

[1]

Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco

Digital Library

[2]

Sriwanna K, Puntumapon K, Waiyamai K (2012) An enhanced class-attribute interdependence maximization discretization algorithm. Springer, Berlin

[3]

Yang P, Li J-S, Huang Y-X (2011) Hdd: a hypercube division-based algorithm for discretisation. Int J Syst Sci 42(4):557---566

Digital Library

[4]

Bay SD (2001) Multivariate discretization for set mining. Knowl Inf Syst 3(4):491---512

Digital Library

[5]

de Sá CR, Soares C, Knobbe A (2016) Entropy-based discretization methods for ranking data. Information Sciences 329:921---936 (special issue on Discovery Science)

Digital Library

[6]

Ramírez-Gallego S, García S, Mouriño-Talín H, Martínez-Rego D, Bolón-Canedo V, Alonso-Betanzos A, Benítez JM, Herrera F (2016) Data discretization: taxonomy and big data challenge. Wiley Interdiscip Rev 6(1):5---21

Digital Library

[7]

Garcia S, Luengo J, Sáez JA, López V, Herrera F (2013) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734---750

Digital Library

[8]

Sang Y, Li K (2012) Combining univariate and multivariate bottom-up discretization. Multiple-Valued Logic and Soft Computing 20(1---2):161---187

[9]

Liu H, Hussain F, Tan CL, Dash M (2002) Discretization: an enabling technique. Data Min Knowl Discov 6(4):393---423

Digital Library

[10]

Dougherty J, Kohavi R, Sahami M et al (1995) Supervised and unsupervised discretization of continuous features. In: Machine learning: proceedings of the Twelfth international conference, vol 12, pp 194---202

Digital Library

[11]

Kerber R (1992) Chimerge: discretization of numeric attributes. In: Proceedings of the tenth national conference on artificial intelligence. Aaai Press, San Jose, pp 123---128

Digital Library

[12]

Liu H, Setiono R (1997) Feature selection via discretization. IEEE Trans Knowl Data Eng 9(4):642---645

Digital Library

[13]

Tay FE, Shen L (2002) A modified chi2 algorithm for discretization. IEEE Trans Knowl Data Eng 14(3):666---670

Digital Library

[14]

Sang Y, Qi H, Li K, Jin Y, Yan D, Gao S (2014) An effective discretization method for disposing high-dimensional data. Inf Sci 270:73---91

[15]

Kurgan LA, Cios KJ (2004) Caim discretization algorithm. IEEE Trans Knowl Data Eng 16(2):145---153

Digital Library

[16]

Cano A, Nguyen DT, Ventura S, Cios KJ (2016) ur-caim: improved caim discretization for unbalanced and balanced data. Soft Computing 20(1):173---188

Digital Library

[17]

Ching JY, Wong AK, Chan KCC (1995) Class-dependent discretization for inductive learning from continuous and mixed-mode data. IEEE Trans Pattern Anal Mach Intell 17(7):641---651

Digital Library

[18]

Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th international joint conference on artificial intelligence. Chambéry, France, 28 Aug---3 Sept 1993, pp 1022---1029

[19]

Catlett J (1991) On changing continuous attributes into ordered discrete attributes. In: Kodratoff Y. (eds) Machine Learning -- EWSL-91. EWSL 1991. Lecture notes in computer science (Lecture notes in artificial intelligence), vol 482. Springer, Berlin

Digital Library

[20]

Zeinalkhani M, Eftekhari M (2014) Fuzzy partitioning of continuous attributes through discretization methods to construct fuzzy decision tree classifiers. Inf Sci 278:715---735

[21]

Yang Y, Webb GI (2009) Discretization for naive-bayes learning: managing discretization bias and variance. Mach Learn 74(1):39---74

Digital Library

[22]

Kang Y, Wang S, Liu X, Lai H, Wang H, Miao B (2006) An ICA-based multivariate discretization algorithm. Springer, Berlin

[23]

Gupta A, Mehrotra KG, Mohan C (2010) A clustering-based discretization for supervised learning. Stat Probab Lett 80(9):816---824

[24]

Singh GK, Minz S (2007) Discretization using clustering and rough set theory. In: International conference on computing: theory and applications, 2007. ICCTA'07. IEEE, New York, pp 330---336

Digital Library

[25]

Hartigan JA, Wong MA (1979) Algorithm as 136: a k-means clustering algorithm. Appl Stat 28:100---108

[26]

Ertoz L, Steinbach M, Kumar V (2002) A new shared nearest neighbor clustering algorithm and its applications. In: Workshop on clustering high dimensional data and its applications at 2nd SIAM international conference on data mining, pp 105---115

[27]

Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226---231

Digital Library

[28]

Sriwanna K, Boongoen T, Iam-On N (2016) In: Lavangnananda K, Phon-Amnuaisuk S, Engchuan W, Chan JH (eds) An enhanced univariate discretization based on cluster ensembles. Springer, Cham, pp 85---98

[29]

Iam-On N, Boongoen T, Garrett S, Price C (2011) A link-based approach to the cluster ensemble problem. IEEE Trans Pattern Anal Mach Intell 33(12):2396---2409

Digital Library

[30]

Huang X, Zheng X, Yuan W, Wang F, Zhu S (2011) Enhanced clustering of biomedical documents using ensemble non-negative matrix factorization. Inf Sci 181(11):2293---2302

Digital Library

[31]

Ramirez-Gallego S, Garcia S, Benitez JM, Herrera F (2016) Multivariate discretization based on evolutionary cut points selection for classification. IEEE Transactions on Cybernetics 46(3):595---608

[32]

Parashar A, Gulati Y (2012) Survey of di erent partition clustering algorithms and their comparative studies. International Journal of Advanced Research in Computer Science 3(3):675---680

[33]

Brandes U, Gaertler M, Wagner D (2007) Engineering graph clustering: models and experimental evaluation. ACM J Exp Algorithm 12(1.1):1---26

Digital Library

[34]

Van Dongen SM (2001) Graph clustering by ow simulation. PhD thesis, University of Utrecht

[35]

Schaeffer SE (2007) Graph clustering. Comput Sci Rev 1(1):27---64

Digital Library

[36]

Foggia P, Percannella G, Sansone C, Vento M (2009) Benchmarking graph-based clustering algorithms. Image Vis Comput 27(7):979---988

Digital Library

[37]

Zhou Y, Cheng H, Yu JX (2009) Graph clustering based on structural/attribute similarities. Proc VLDB Endow 2(1):718---729

Digital Library

[38]

Cheng H, Zhou Y, Yu JX (2011) Clustering large attributed graphs: a balance between structural and attribute similarities. ACM Trans Knowl Discov Data 5(2):12

Digital Library

[39]

Nascimento MC, De Carvalho AC (2011) Spectral methods for graph clustering-a survey. Eur J Oper Res 211(2):221---231

[40]

Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22(8):888---905

Digital Library

[41]

Foggia P, Percannella G, Sansone C, Vento M (2007) In: Escolano F, Vento M (eds) Assessing the performance of a graph-based clustering algorithm. Springer, Berlin, pp 215---227

Digital Library

[42]

Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30(7):1575---1584

[43]

Kannan R, Vempala S, Vetta A (2004) On clusterings: good, bad and spectral. J ACM 51(3):497---515

Digital Library

[44]

Brandes U, Gaertler M, Wagner D (2003) Experiments on graph clustering algorithms. Springer, Berlin, pp 568---579

[45]

Kong W, Hu S, Zhang J, Dai G (2013) Robust and smart spectral clustering from normalized cut. Neural Comput Appl 23(5):1503---1512

[46]

Sen D, Gupta N, Pal SK (2013) Incorporating local image structure in normalized cut based graph partitioning for grouping of pixels. Inf Sci 248:214---238

[47]

Cha S-H (2007) Comprehensive survey on distance/similarity measures between probability density functions. City 1(2):1

[48]

Everitt B, Landau S, Leese M (1993) Cluster analysis (Edward Arnold, London). ISBN 0-470-22043-0

[49]

Soman KP, Diwakar S, Ajay V (2006) Data mining: theory and practice {with CD}. PHI Learn

[50]

Chapanond A (2007) Application aspects of data mining analysis on evolving graphs. PhD thesis, Troy

Digital Library

[51]

Boutin F, Hascoet M (2004) Cluster validity indices for graph partitioning. In: Proceedings, eighth international conference on information visualisation, 2004. IV 2004. IEEE, New York, pp 376---381

Digital Library

[52]

Dua S, Chowriappa P (2012) Data mining for bioinformatics. CRC Press, Boca Raton

[53]

Görke R, Kappes A, Wagner D (2014) Experiments on density-constrained graph clustering. J Exp Algorithmics 19:6

Digital Library

[54]

Leighton T, Rao S (1988) An approximate max-flow min-cut theorem for uniform multicommodity flow problems with applications to approximation algorithms. In: 29th annual symposium on foundations of computer science, 1988. IEEE, New York, pp 422---431

Digital Library

[55]

Ding CH, He X, Zha H, Gu M, Simon HD (2001) A min-max cut algorithm for graph partitioning and data clustering. In: Proceedings IEEE international conference on data mining, 2001, ICDM 2001. IEEE, New York, pp 107---114

Digital Library

[56]

Mohar B, Alavi Y (1991) The laplacian spectrum of graphs. Graph Theory Comb Appl 2:871---898

[57]

Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395---416

Digital Library

[58]

Lichman M (2013) UCI machine learning repository

[59]

Alcalá J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2010) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Log Soft Comput 17(255---287):11

[60]

Alcalá-Fdez J, Sánchez L, García S, del Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM, Fernández JC, Herrera F (2009) Keel: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 13(3):307---318

Digital Library

[61]

Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., San Francisco

Digital Library

[62]

Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6(1):37---66

Digital Library

[63]

John GH, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. Proceedings of the eleventh conference on uncertainty in artificial intelligence. UAI'95. Morgan Kaufmann Publishers Inc., San Francisco, pp 338---345

Digital Library

[64]

Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273---297

[65]

Wu X, Kumar V (2009) The top ten algorithms in data mining, 1st edn. Chapman & Hall/CRC, Boca Raton

Digital Library

[66]

Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1---37

Digital Library

[67]

Kohavi R et al (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. Ijcai 14:1137---1145

Digital Library

[68]

Bradley AP (1997) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145---1159

Digital Library

[69]

Huang J, Ling CX (2005) Using auc and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng 17(3):299---310

Digital Library

[70]

Ruan J, Jahid MJ, Gu F, Lei C, Huang YW, Hsu YT, Mutch DG, Chen CL, Kirma NB, Huang THM (2016) A novel algorithm for network-based prediction of cancer recurrence. Genomics.

[71]

Lv J, Peng Q, Chen X, Sun Z (2016) A multi-objective heuristic algorithm for gene expression microarray data classification. Expert Syst Appl 59:13---19

Digital Library

[72]

Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675---701

[73]

Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11(1):86---92

[74]

Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1---30

Digital Library

[75]

García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044---2064

Digital Library

[76]

Holm S (1979) A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6:65---70

[77]

Gonzalez-Abril L, Cuberos FJ, Velasco F, Ortega JA (2009) Ameva: an autonomous discretization algorithm. Expert Syst Appl 36(3):5327---5332

Digital Library

[78]

Tsai C-J, Lee C-I, Yang W-P (2008) A discretization algorithm based on class-attribute contingency coefficient. Inf Sci 178(3):714---731

Digital Library

[79]

Eshelman LJ (1991) The CHC adaptive search algorithm: how to have safe search when engaging in nontraditional genetic recombination. Found Genet Algorithms 1:265---283

[80]

Zighed DA, Rabaséda S, Rakotomalala R (1998) Fusinter: a method for discretization of continuous attributes. Int J Uncertain Fuzziness Knowl Based Syst 6(03):307---326

Digital Library

[81]

Wong AKC, Liu TS (1975) Typicality, diversity, and feature pattern of an ensemble. IEEE Trans Comput 100(2):158---181

Digital Library

[82]

Huang W (1997) Discretization of continuous attributes for inductive machine learning. Toledo, Department Computer Science, University of Toledo

[83]

Ho KM, Scott PD (1997) Zeta: a global method for discretization of continuous variables. In: Proceedings of the third international conference knowledge discovery and data mining (KDD97), pp 191---194

Digital Library

[84]

Healey J (2014) Statistics: a tool for social research. Cengage Learn

Cited By

Sandhu JVerma ARana P(2020)An Expert Approach for Data Flow Prediction: Case Study of Wireless Sensor NetworksWireless Personal Communications: An International Journal10.1007/s11277-020-07028-4112:1(325-352)Online publication date: 20-Jan-2020
https://dl.acm.org/doi/10.1007/s11277-020-07028-4
Khan MHayat HAwan I(2019)Hybrid case-base maintenance approach for modeling large scale case-based reasoning systemsHuman-centric Computing and Information Sciences10.1186/s13673-019-0171-z9:1(1-25)Online publication date: 1-Dec-2019
https://dl.acm.org/doi/10.1186/s13673-019-0171-z
Lee JMoon DKim ILee Y(2019)A semantic approach to improving machine readability of a large-scale attack graphThe Journal of Supercomputing10.1007/s11227-018-2394-675:6(3028-3045)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1007/s11227-018-2394-6
Show More Cited By

Graph clustering-based discretization of splitting and merging methods (GraphS and GraphM)
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
      2. Unsupervised learning

Recommendations

Graph clustering-based discretization approach to microarray data

Several techniques in data mining require discrete data. In fact, learning with discrete domains often performs better than the case of continuous data. Multivariate discretization is the algorithm that transforms continuous data to discrete one by ...
Clustering Large Attributed Graphs: A Balance between Structural and Attribute Similarities

Social networks, sensor networks, biological networks, and many other information networks can be modeled as a large graph. Graph vertices represent entities, and graph edges represent their relationships or interactions. In many large graphs, there is ...
An ICA-Based multivariate discretization algorithm
KSEM'06: Proceedings of the First international conference on Knowledge Science, Engineering and Management

Discretization is an important preprocessing technique in data mining tasks. Univariate Discretization is the most commonly used method. It discretizes only one single attribute of a dataset at a time, without considering the interaction information ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Human-centric Computing and Information Sciences

Human-centric Computing and Information Sciences Volume 7, Issue 1

December 2017

729 pages

ISSN:2192-1962

EISSN:2192-1962

Issue’s Table of Contents

Copyright © Copyright © 2017 The Author(s).

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 December 2017

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sandhu JVerma ARana P(2020)An Expert Approach for Data Flow Prediction: Case Study of Wireless Sensor NetworksWireless Personal Communications: An International Journal10.1007/s11277-020-07028-4112:1(325-352)Online publication date: 20-Jan-2020
https://dl.acm.org/doi/10.1007/s11277-020-07028-4
Khan MHayat HAwan I(2019)Hybrid case-base maintenance approach for modeling large scale case-based reasoning systemsHuman-centric Computing and Information Sciences10.1186/s13673-019-0171-z9:1(1-25)Online publication date: 1-Dec-2019
https://dl.acm.org/doi/10.1186/s13673-019-0171-z
Lee JMoon DKim ILee Y(2019)A semantic approach to improving machine readability of a large-scale attack graphThe Journal of Supercomputing10.1007/s11227-018-2394-675:6(3028-3045)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1007/s11227-018-2394-6
Chae JJin YWen MZhang WSung YCho K(2019)Genetic algorithm-based adaptive weight decision method for motion estimation frameworkThe Journal of Supercomputing10.1007/s11227-018-2247-375:4(1909-1921)Online publication date: 1-Apr-2019
https://dl.acm.org/doi/10.1007/s11227-018-2247-3
Sriwanna KBoongoen TIam-On N(2019)Graph clustering-based discretization approach to microarray dataKnowledge and Information Systems10.1007/s10115-018-1249-z60:2(879-906)Online publication date: 1-Aug-2019
https://dl.acm.org/doi/10.1007/s10115-018-1249-z

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents