More Web Proxy on the site http://driver.im/

article

Data discretization: taxonomy and big data challenge

Authors:

Sergio Ramírez-Gallego,

Salvador García,

Héctor Mouriño-Talín,

David Martínez-Rego,

Verónica Bolón-Canedo,

Amparo Alonso-Betanzos,

José Manuel Benítez,

Francisco HerreraAuthors Info & Claims

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Volume 6, Issue 1

Pages 5 - 21

https://doi.org/10.1002/widm.1173

Published: 01 January 2016 Publication History

Abstract

Discretization of numerical data is one of the most influential data preprocessing tasks in knowledge discovery and data mining. The purpose of attribute discretization is to find concise data representations as categories which are adequate for the learning task retaining as much information in the original continuous attribute as possible. In this article, we present an updated overview of discretization techniques in conjunction with a complete taxonomy of the leading discretizers. Despite the great impact of discretization as data preprocessing technique, few elementary approaches have been developed in the literature for Big Data. The purpose of this article is twofold: a comprehensive taxonomy of discretization techniques to help the practitioners in the use of the algorithms is presented; the article aims is to demonstrate that standard discretization methods can be parallelized in Big Data platforms such as Apache Spark, boosting both performance and accuracy. We thus propose a distributed implementation of one of the most well-known discretizers based on Information Theory, obtaining better results than the one produced by: the entropy minimization discretizer proposed by Fayyad and Irani. Our scheme goes beyond a simple parallelization and it is intended to be the first to face the Big Data challenge. WIREs Data Mining Knowl Discov 2016, 6:5-21. doi: 10.1002/widm.1173

References

[1]

Liu H, Hussain F, Lim Tan C, Dash M. Discretization: an enabling technique. Data Min Knowl Discov 2002, Volume 6: pp.393-423.

[2]

García S, Luengo J, Antonio Sáez J, López V, Herrera F. A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 2013, Volume 25: pp.734-750.

[3]

García S, Luengo J, Herrera F. Data Preprocessing in Data Mining. Germany: Springer; 2015.

[4]

Wu X, Kumar V, eds. The Top Ten Algorithms in Data Mining. USA: Chapman & Hall/CRC Data Mining and Knowledge Discovery; 2009.

[5]

Ross Quinlan J. C4.5: Programs for Machine Learning. USA: Morgan Kaufmann Publishers Inc.; 1993.

[6]

Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Proceedings of the 20th Very Large Data Bases conference VLDB, Santiago de Chile, Chile, 1994, pages 487-499.

[7]

Yang Y, Webb GI. Discretization for Naïve-Bayes learning: managing discretization bias and variance. Mach Learn 2009, Volume 74: pp.39-74.

Digital Library

[8]

Hu H-W, Chen Y-L, Tang K. A dynamic discretization approach for constructing decision trees with a continuous label. IEEE Trans Knowl Data Eng 2009, Volume 21: pp.1505-1514.

[9]

Yang Y, Webb GI, Wu X. Discretization methods. In: Data Mining and Knowledge Discovery Handbook. Germany: Springer; 2010, pp.101-116.

[10]

Wu X, Zhu X, Wu G-Q, Ding W. Data mining with big data. IEEE Trans Knowl Data Eng 2014, Volume 26: pp.97-107.

[11]

Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. In: San Francisco, CA, OSDI, 2004, pages 137-150.

[12]

Apache Hadoop Project. Apache Hadoop, 2015. {Online "https://hadoop.apache.org/"; Accessed March, 2015}.

[13]

White T. Hadoop, The Definitive Guide. USA: O'Reilly Media, Inc.; 2012.

[14]

Apache Spark: lightning-fast cluster computing. Apache spark, 2015. {Online "http://spark.apache.org/"; Accessed March, 2015}.

[15]

Hamstra M, Karau H, Zaharia M, Konwinski A, Wendell P. Learning Spark: Lightning-Fast Big Data Analytics. USA: O'Reilly Media, Incorporated; 2015.

[16]

Apache Mahout Project. Apache Mahout, 2015. {Online "http://mahout.apache.org/"; Accessed March, 2015}.

[17]

Machine Learning Library MLlib for Spark. Mllib, 2015. {Online "https://spark.apache.org/docs/1.2.0/mllib-guide.html"; Accessed March, 2015}.

[18]

Fayyad UM, Irani KB. Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence IJCAI, San Francisco, CA, 1993, pages 1022-1029.

[19]

Wong AKC, Chiu DKY. Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Trans Pattern Anal Mach Intell 1987, Volume 9: pp.796-805.

Digital Library

[20]

Chou PA. Optimal partitioning for classification and regression trees. IEEE Trans Pattern Anal Mach Intell 1991, Volume 13: pp.340-354.

[21]

Catlett J. On changing continuous attributes into ordered discrete attributes. In: European Working Session on Learning EWSL. <bookSeriesTitle>Lecture Notes on Computer Science</bookSeriesTitle>, vol. Volume 482. Germany: Springer-Verlag; 1991, pp.164-178.

[22]

Kerber R. Chimerge: discretization of numeric attributes. In: National Conference on Artifical Intelligence American Association for Artificial Intelligence AAAI, San Jose, California, 1992, pages 123-128.

[23]

Holte RC. Very simple classification rules perform well on most commonly used datasets. Mach Learn 1993, Volume 11: pp.63-90.

[24]

Ching JY, Wong AKC, Chan KCC. Class-dependent discretization for inductive learning from continuous and mixed-mode data. IEEE Trans Pattern Anal Mach Intell 1995, Volume 17: pp.641-651.

[25]

Pfahringer B. Compression-based discretization of continuous attributes. In: Proceedings of the 12th International Conference on Machine Learning ICML, Tahoe City, California, 1995, pages 456-463.

[26]

Xindong W. A Bayesian discretizer for real-valued attributes. Comput J 1996, Volume 39: pp.688-691.

[27]

Friedman N, Goldszmidt M. Discretizing continuous attributes while learning Bayesian networks. In: Proceedings of the 13th International Conference on Machine Learning ICML, Bari, Italy, 1996, pages 157-165.

[28]

Chmielewski MR, Grzymala-Busse JW. Global discretization of continuous attributes as preprocessing for machine learning. Int J Approx Reason 1996, Volume 15: pp.319-331.

[29]

Ho KM, Scott PD. Zeta: a global method for discretization of continuous variables. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining KDD, Newport Beach, California, 1997, pages 191-194.

[30]

Cerquides J, De Mantaras RL. Proposal and empirical comparison of a parallelizable distance-based discretization method. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining KDD, Newport Beach, California, 1997, pages 139-142.

Digital Library

[31]

Liu H, Setiono R. Feature selection via discretization. IEEE Trans Knowl Data Eng 1997, Volume 9: pp.642-645.

[32]

Hong SJ. Use of contextual information for feature ranking and discretization. IEEE Trans Knowl Data Eng 1997, Volume 9: pp.718-730.

[33]

Zighed DA, Rabaséda S, Rakotomalala R. FUSINTER: a method for discretization of continuous attributes. Int J Unc Fuzz Knowl Based Syst 1998, Volume 6: pp.307-326.

[34]

Bay SD. Multivariate discretization for set mining. Knowl Inform Syst 2001, Volume 3: pp.491-512.

[35]

Tay FEH, Shen L. A modified chi2 algorithm for discretization. IEEE Trans Knowl Data Eng 2002, Volume 14: pp.666-670.

Digital Library

[36]

Giráldez R, Aguilar-Ruiz JS, Riquelme JC, Ferrer-Troyano FJ, Rodríguez-Baena DS. Discretization oriented to decision rules generation. In: Frontiers in Artificial Intelligence and Applications, vol. Volume 82. Netherlands: IOS press; 2002, pp.275-279.

[37]

Boulle M. Khiops: a statistical discretization method of continuous attributes. Mach Learn 2004, Volume 55: pp.53-69.

[38]

Kurgan LA, Cios KJ. CAIM discretization algorithm. IEEE Trans Knowl Data Eng 2004, Volume 16: pp.145-153.

Digital Library

[39]

Chao-Ton S, Hsu J-H. An extended chi2 algorithm for discretization of real value attributes. IEEE Trans Knowl Data Eng 2005, Volume 17: pp.437-441.

Digital Library

[40]

Liu X, Wang H. A discretization algorithm based on a heterogeneity criterion. IEEE Trans Knowl Data Eng 2005, Volume 17: pp.1166-1173.

Digital Library

[41]

Mehta S, Parthasarathy S, Yang H. Toward unsupervised correlation preserving discretization. IEEE Trans Knowl Data Eng 2005, Volume 17: pp.1174-1185.

Digital Library

[42]

Boullé M. MODL: a Bayes optimal discretization method for continuous attributes. Mach Learn 2006, Volume 65: pp.131-165.

Digital Library

[43]

Au W-H, Chan KCC, Wong AKC. A fuzzy approach to partitioning continuous attributes for classification. IEEE Trans Knowl Data Eng 2006, Volume 18: pp.715-719.

Digital Library

[44]

Lee C-H. A Hellinger-based discretization method for numeric attributes in classification learning. Knowl Based Syst 2007, Volume 20: pp.419-425.

[45]

Wu QX, Bell DA, Prasad G, McGinnity TM. A distribution-index-based discretizer for decision-making with symbolic AI approaches. IEEE Trans Knowl Data Eng 2007, Volume 19: pp.17-28.

[46]

Ruiz FJ, Angulo C, Agell N. IDD: a supervised interval Distance-Based method for discretization. IEEE Trans Knowl Data Eng 2008, Volume 20: pp.1230-1238.

[47]

Tsai C-J, Lee C-I, Yang W-P. A discretization algorithm based on class-attribute contingency coefficient. Inform Sci 2008, Volume 178: pp.714-731.

[48]

González-Abril L, Cuberos FJ, Velasco F, Ortega JA. Ameva: an autonomous discretization algorithm. Expert Syst Appl 2009, Volume 36: pp.5327-5332.

[49]

Jin R, Breitbart Y, Muoh C. Data discretization unification. Knowl Inform Syst 2009, Volume 19: pp.1-29.

[50]

Li M, Deng S, Feng S, Fan J. An effective discretization based on class-attribute coherence maximization. Pattern Recognit Lett 2011, Volume 32: pp.1962-1973.

Digital Library

[51]

Gethsiyal Augasta M, Kathirvalavakumar T. A new discretization algorithm based on range coefficient of dispersion and skewness for neural networks classifier. Appl Soft Comput 2012, Volume 12: pp.619-625.

[52]

Shehzad K. EDISC: a class-tailored discretization technique for rule-based classification. IEEE Trans Knowl Data Eng 2012, Volume 24: pp.1435-1447.

[53]

Ferreira AJ, Figueiredo MAT. An unsupervised approach to feature discretization and selection. Pattern Recognit 2012, Volume 45: pp.3048-3060.

[54]

Kurtcephe M, Altay Güvenir H. A discretization method based on maximizing the area under receiver operating characteristic curve. Intern J Pattern Recognit Artif Intell 2013, Volume 27.

[55]

Ferreira AJ, Figueiredo MAT. Incremental filter and wrapper approaches for feature discretization. Neurocomputing 2014, Volume 123: pp.60-74.

[56]

Yan D, Liu D, Sang Y. A new approach for discretizing continuous attributes in learning systems. Neurocomputing 2014, Volume 133: pp.507-511.

[57]

Sang Y, Qi H, Li K, Jin Y, Yan D, Gao S. An effective discretization method for disposing high-dimensional data. Inform Sci 2014, Volume 270: pp.73-91.

[58]

Nguyen H-V, Müller E, Vreeken J, Böhm K. Unsupervised interaction-preserving discretization of multivariate data. Data Min Knowl Discov 2014, Volume 28: pp.1366-1397.

Digital Library

[59]

Jiang F, Sui Y. A novel approach for discretization of continuous attributes in rough set theory. Knowl Based Syst 2015, Volume 73: pp.324-334.

[60]

Moskovitch R, Shahar Y. Classification-driven temporal discretization of multivariate time series. Data Min Knowl Discov 2015, Volume 29: pp.871-913.

Digital Library

[61]

Ramírez-Gallego S, García S, Benítez JM, Herrera F. Multivariate discretization based on evolutionary cut points selection for classification. IEEE Trans Cybern. In press.

[62]

Shannon CE. A mathematical theory of communication. ACM SIGMOBILE Mob Comput Commun Rev 2001, Volume 5: pp.3-55.

[63]

Alcalá-Fdez J, Sánchez L, García S, <familyNamePrefix>del</familyNamePrefix>Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM, et al. KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 2009, Volume 13: pp.307-318.

[64]

VerónicaBolón-Canedo, NoeliaSánchez-Maroño, and AmparoAlonso-Betanzos. On the effectiveness of discretization on gene selection of microarray data. In: International Joint Conference on Neural Networks, IJCNN 2010, Barcelona, Spain, 18-23 July, 2010, pages 1-8.

[65]

Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A. Feature selection and classification in multiple class datasets: an application to KDD Cup 99 dataset. Expert Syst Appl May 2011, Volume 38: pp.5947-5957.

Digital Library

[66]

Cano A, Ventura S, Cios KJ. Scalable CAIM discretization on multiple GPUs using concurrent kernels. J Supercomput 2014, Volume 69: pp.273-292.

[67]

Zhang Y, Cheung Y-M. Discretizing numerical attributes in decision tree for big data analysis. In: ICDM Workshops, Shenzhen, China, 2014, pages 1150-1157.

[68]

Beyer M.A., Laney D. 3D data management: controlling data volume, velocity and variety, 2001. {Online "http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf"; Accessed March, 2015}.

[69]

Fernández A, <familyNamePrefix>del</familyNamePrefix>Río S, López V, Bawakid A, <familyNamePrefix>del</familyNamePrefix>Jesús MJ, Benítez JM, Herrera F. Big data with cloud computing: an insight on the computing environment, mapreduce, and programming frameworks. WIREs Data Min Knowl Discov 2014, Volume 4: pp.380-409.

Digital Library

[70]

Lin J. Mapreduce is good enough? If all you have is a hammer, throw away everything that's not a nail!. Clin Orthop Relat Res 2012, abs/1209.2191.

[71]

Rio S, Lopez V, Benitez JM, Herrera F. On the use of mapreduce for imbalanced big data using random forest. Inform Sci 2014, Volume 285: pp.112-137.

[72]

Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2011, Volume 2: pp.1-27. Datasets available at "http://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/".

[73]

Duda RO, Hart PE. Pattern Classification and Scene Analysis, vol. Volume 3. New York: John Wiley & Sons; 1973.

[74]

Quinlan JR. Induction of decision trees. In: Shavlik JW, Dietterich TG, eds. Readings in Machine Learning. Burlington, MA: Morgan Kaufmann Publishers; 1990. Originally published in Machine Learning 1:81-106, 1986.

Digital Library

Cited By

Jazayeri AYang C(2024)Frequent Pattern Mining in Continuous-Time Temporal NetworksIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332479946:1(305-321)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TPAMI.2023.3324799
Esme E(2024)Enhancing classification accuracy through feature extraction: a comparative study of discretization and clustering approaches on sensor-based datasetsKnowledge and Information Systems10.1007/s10115-023-01960-066:1(339-356)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s10115-023-01960-0
Morovatian IBasiri ARezaei S(2024)RUCIB: a novel rule-based classifier based on BRADO algorithmComputing10.1007/s00607-023-01226-1106:2(495-519)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s00607-023-01226-1
Show More Cited By

Data discretization: taxonomy and big data challenge
1. Information systems
  1. Information systems applications

Recommendations

CAIM Discretization Algorithm

Abstract--The task of extracting knowledge from databases is quite often performed by machine learning algorithms. The majority of these algorithms can be applied only to data described by discrete numerical or nominal attributes (features). In the case ...
Data Mining via Discretization, Generalization and Rough Set Feature Selection
Abstract
We present a data mining method which integrates discretization, generalization and rough set feature selection. Our method reduces the data horizontally and vertically. In the first phase, discretization and generalization are integrated. Numeric ...
Discretization: An Enabling Technique

Discrete values have important roles in data mining and knowledge discovery. They are about intervals of numbers which are more concise to represent and specify, easier to use and comprehend as they are closer to a knowledge-level representation than ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery

Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery Volume 6, Issue 1

January 2016

45 pages

EISSN:1942-4795

Issue’s Table of Contents

Publisher

John Wiley & Sons, Inc.

United States

Publication History

Published: 01 January 2016

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

20
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jazayeri AYang C(2024)Frequent Pattern Mining in Continuous-Time Temporal NetworksIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332479946:1(305-321)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TPAMI.2023.3324799
Esme E(2024)Enhancing classification accuracy through feature extraction: a comparative study of discretization and clustering approaches on sensor-based datasetsKnowledge and Information Systems10.1007/s10115-023-01960-066:1(339-356)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s10115-023-01960-0
Morovatian IBasiri ARezaei S(2024)RUCIB: a novel rule-based classifier based on BRADO algorithmComputing10.1007/s00607-023-01226-1106:2(495-519)Online publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1007/s00607-023-01226-1
Quoc Trung BHoang Minh VThi Hoai Linh NThi Mai Anh B(2024)A Novel Dynamic Programming Method for Non-parametric Data DiscretizationIntelligent Information and Database Systems10.1007/978-981-97-4982-9_17(215-227)Online publication date: 15-Apr-2024
https://dl.acm.org/doi/10.1007/978-981-97-4982-9_17
Vo VLe TNguyen VZhao HBonilla EHaffari GPhung DSingh ASun YAkoglu LGunopulos DYan XKumar ROzcan FYe J(2023)Feature-based Learning for Diverse and Privacy-Preserving Counterfactual ExplanationsProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599343(2211-2222)Online publication date: 6-Aug-2023
https://dl.acm.org/doi/10.1145/3580305.3599343
Chen QDing WHuang XWang H(2023)Generalized Interval Type-II Fuzzy Rough Model-Based Feature Discretization for Mixed PixelsIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2022.319062531:3(845-859)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.1109/TFUZZ.2022.3190625
Itzhak NJaroszewicz SMoskovitch R(2023)Continuous prediction of a time intervals-related pattern’s completionKnowledge and Information Systems10.1007/s10115-023-01910-w65:11(4797-4846)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1007/s10115-023-01910-w
Chen QHuang M(2021)Rough fuzzy model based feature discretization in intelligent data preprocessJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-020-00216-410:1Online publication date: 18-Jan-2021
https://dl.acm.org/doi/10.1186/s13677-020-00216-4
Barsacchi MBechini AMarcelloni FHung CHong JBechini ASong E(2021)Implicitly distributed fuzzy random forestsProceedings of the 36th Annual ACM Symposium on Applied Computing10.1145/3412841.3442082(392-399)Online publication date: 22-Mar-2021
https://dl.acm.org/doi/10.1145/3412841.3442082
Kurniawan IAsyhari AHe FLiu Y(2021)Mobile computing and communications-driven fog-assisted disaster evacuation techniques for context-aware guidance supportComputer Communications10.1016/j.comcom.2021.07.020179:C(195-216)Online publication date: 1-Nov-2021
https://dl.acm.org/doi/10.1016/j.comcom.2021.07.020
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents