[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Data discretization: taxonomy and big data challenge

Published: 01 January 2016 Publication History

Abstract

Discretization of numerical data is one of the most influential data preprocessing tasks in knowledge discovery and data mining. The purpose of attribute discretization is to find concise data representations as categories which are adequate for the learning task retaining as much information in the original continuous attribute as possible. In this article, we present an updated overview of discretization techniques in conjunction with a complete taxonomy of the leading discretizers. Despite the great impact of discretization as data preprocessing technique, few elementary approaches have been developed in the literature for Big Data. The purpose of this article is twofold: a comprehensive taxonomy of discretization techniques to help the practitioners in the use of the algorithms is presented; the article aims is to demonstrate that standard discretization methods can be parallelized in Big Data platforms such as Apache Spark, boosting both performance and accuracy. We thus propose a distributed implementation of one of the most well-known discretizers based on Information Theory, obtaining better results than the one produced by: the entropy minimization discretizer proposed by Fayyad and Irani. Our scheme goes beyond a simple parallelization and it is intended to be the first to face the Big Data challenge. WIREs Data Mining Knowl Discov 2016, 6:5-21. doi: 10.1002/widm.1173

References

[1]
Liu H, Hussain F, Lim Tan C, Dash M. Discretization: an enabling technique. Data Min Knowl Discov 2002, Volume 6: pp.393-423.
[2]
García S, Luengo J, Antonio Sáez J, López V, Herrera F. A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 2013, Volume 25: pp.734-750.
[3]
García S, Luengo J, Herrera F. Data Preprocessing in Data Mining. Germany: Springer; 2015.
[4]
Wu X, Kumar V, eds. The Top Ten Algorithms in Data Mining. USA: Chapman & Hall/CRC Data Mining and Knowledge Discovery; 2009.
[5]
Ross Quinlan J. C4.5: Programs for Machine Learning. USA: Morgan Kaufmann Publishers Inc.; 1993.
[6]
Agrawal R, Srikant R. Fast algorithms for mining association rules. In: Proceedings of the 20th Very Large Data Bases conference VLDB, Santiago de Chile, Chile, 1994, pages 487-499.
[7]
Yang Y, Webb GI. Discretization for Naïve-Bayes learning: managing discretization bias and variance. Mach Learn 2009, Volume 74: pp.39-74.
[8]
Hu H-W, Chen Y-L, Tang K. A dynamic discretization approach for constructing decision trees with a continuous label. IEEE Trans Knowl Data Eng 2009, Volume 21: pp.1505-1514.
[9]
Yang Y, Webb GI, Wu X. Discretization methods. In: Data Mining and Knowledge Discovery Handbook. Germany: Springer; 2010, pp.101-116.
[10]
Wu X, Zhu X, Wu G-Q, Ding W. Data mining with big data. IEEE Trans Knowl Data Eng 2014, Volume 26: pp.97-107.
[11]
Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. In: San Francisco, CA, OSDI, 2004, pages 137-150.
[12]
Apache Hadoop Project. Apache Hadoop, 2015. {Online "https://hadoop.apache.org/"; Accessed March, 2015}.
[13]
White T. Hadoop, The Definitive Guide. USA: O'Reilly Media, Inc.; 2012.
[14]
Apache Spark: lightning-fast cluster computing. Apache spark, 2015. {Online "http://spark.apache.org/"; Accessed March, 2015}.
[15]
Hamstra M, Karau H, Zaharia M, Konwinski A, Wendell P. Learning Spark: Lightning-Fast Big Data Analytics. USA: O'Reilly Media, Incorporated; 2015.
[16]
Apache Mahout Project. Apache Mahout, 2015. {Online "http://mahout.apache.org/"; Accessed March, 2015}.
[17]
Machine Learning Library MLlib for Spark. Mllib, 2015. {Online "https://spark.apache.org/docs/1.2.0/mllib-guide.html"; Accessed March, 2015}.
[18]
Fayyad UM, Irani KB. Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th International Joint Conference on Artificial Intelligence IJCAI, San Francisco, CA, 1993, pages 1022-1029.
[19]
Wong AKC, Chiu DKY. Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Trans Pattern Anal Mach Intell 1987, Volume 9: pp.796-805.
[20]
Chou PA. Optimal partitioning for classification and regression trees. IEEE Trans Pattern Anal Mach Intell 1991, Volume 13: pp.340-354.
[21]
Catlett J. On changing continuous attributes into ordered discrete attributes. In: European Working Session on Learning EWSL. <bookSeriesTitle>Lecture Notes on Computer Science</bookSeriesTitle>, vol. Volume 482. Germany: Springer-Verlag; 1991, pp.164-178.
[22]
Kerber R. Chimerge: discretization of numeric attributes. In: National Conference on Artifical Intelligence American Association for Artificial Intelligence AAAI, San Jose, California, 1992, pages 123-128.
[23]
Holte RC. Very simple classification rules perform well on most commonly used datasets. Mach Learn 1993, Volume 11: pp.63-90.
[24]
Ching JY, Wong AKC, Chan KCC. Class-dependent discretization for inductive learning from continuous and mixed-mode data. IEEE Trans Pattern Anal Mach Intell 1995, Volume 17: pp.641-651.
[25]
Pfahringer B. Compression-based discretization of continuous attributes. In: Proceedings of the 12th International Conference on Machine Learning ICML, Tahoe City, California, 1995, pages 456-463.
[26]
Xindong W. A Bayesian discretizer for real-valued attributes. Comput J 1996, Volume 39: pp.688-691.
[27]
Friedman N, Goldszmidt M. Discretizing continuous attributes while learning Bayesian networks. In: Proceedings of the 13th International Conference on Machine Learning ICML, Bari, Italy, 1996, pages 157-165.
[28]
Chmielewski MR, Grzymala-Busse JW. Global discretization of continuous attributes as preprocessing for machine learning. Int J Approx Reason 1996, Volume 15: pp.319-331.
[29]
Ho KM, Scott PD. Zeta: a global method for discretization of continuous variables. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining KDD, Newport Beach, California, 1997, pages 191-194.
[30]
Cerquides J, De Mantaras RL. Proposal and empirical comparison of a parallelizable distance-based discretization method. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining KDD, Newport Beach, California, 1997, pages 139-142.
[31]
Liu H, Setiono R. Feature selection via discretization. IEEE Trans Knowl Data Eng 1997, Volume 9: pp.642-645.
[32]
Hong SJ. Use of contextual information for feature ranking and discretization. IEEE Trans Knowl Data Eng 1997, Volume 9: pp.718-730.
[33]
Zighed DA, Rabaséda S, Rakotomalala R. FUSINTER: a method for discretization of continuous attributes. Int J Unc Fuzz Knowl Based Syst 1998, Volume 6: pp.307-326.
[34]
Bay SD. Multivariate discretization for set mining. Knowl Inform Syst 2001, Volume 3: pp.491-512.
[35]
Tay FEH, Shen L. A modified chi2 algorithm for discretization. IEEE Trans Knowl Data Eng 2002, Volume 14: pp.666-670.
[36]
Giráldez R, Aguilar-Ruiz JS, Riquelme JC, Ferrer-Troyano FJ, Rodríguez-Baena DS. Discretization oriented to decision rules generation. In: Frontiers in Artificial Intelligence and Applications, vol. Volume 82. Netherlands: IOS press; 2002, pp.275-279.
[37]
Boulle M. Khiops: a statistical discretization method of continuous attributes. Mach Learn 2004, Volume 55: pp.53-69.
[38]
Kurgan LA, Cios KJ. CAIM discretization algorithm. IEEE Trans Knowl Data Eng 2004, Volume 16: pp.145-153.
[39]
Chao-Ton S, Hsu J-H. An extended chi2 algorithm for discretization of real value attributes. IEEE Trans Knowl Data Eng 2005, Volume 17: pp.437-441.
[40]
Liu X, Wang H. A discretization algorithm based on a heterogeneity criterion. IEEE Trans Knowl Data Eng 2005, Volume 17: pp.1166-1173.
[41]
Mehta S, Parthasarathy S, Yang H. Toward unsupervised correlation preserving discretization. IEEE Trans Knowl Data Eng 2005, Volume 17: pp.1174-1185.
[42]
Boullé M. MODL: a Bayes optimal discretization method for continuous attributes. Mach Learn 2006, Volume 65: pp.131-165.
[43]
Au W-H, Chan KCC, Wong AKC. A fuzzy approach to partitioning continuous attributes for classification. IEEE Trans Knowl Data Eng 2006, Volume 18: pp.715-719.
[44]
Lee C-H. A Hellinger-based discretization method for numeric attributes in classification learning. Knowl Based Syst 2007, Volume 20: pp.419-425.
[45]
Wu QX, Bell DA, Prasad G, McGinnity TM. A distribution-index-based discretizer for decision-making with symbolic AI approaches. IEEE Trans Knowl Data Eng 2007, Volume 19: pp.17-28.
[46]
Ruiz FJ, Angulo C, Agell N. IDD: a supervised interval Distance-Based method for discretization. IEEE Trans Knowl Data Eng 2008, Volume 20: pp.1230-1238.
[47]
Tsai C-J, Lee C-I, Yang W-P. A discretization algorithm based on class-attribute contingency coefficient. Inform Sci 2008, Volume 178: pp.714-731.
[48]
González-Abril L, Cuberos FJ, Velasco F, Ortega JA. Ameva: an autonomous discretization algorithm. Expert Syst Appl 2009, Volume 36: pp.5327-5332.
[49]
Jin R, Breitbart Y, Muoh C. Data discretization unification. Knowl Inform Syst 2009, Volume 19: pp.1-29.
[50]
Li M, Deng S, Feng S, Fan J. An effective discretization based on class-attribute coherence maximization. Pattern Recognit Lett 2011, Volume 32: pp.1962-1973.
[51]
Gethsiyal Augasta M, Kathirvalavakumar T. A new discretization algorithm based on range coefficient of dispersion and skewness for neural networks classifier. Appl Soft Comput 2012, Volume 12: pp.619-625.
[52]
Shehzad K. EDISC: a class-tailored discretization technique for rule-based classification. IEEE Trans Knowl Data Eng 2012, Volume 24: pp.1435-1447.
[53]
Ferreira AJ, Figueiredo MAT. An unsupervised approach to feature discretization and selection. Pattern Recognit 2012, Volume 45: pp.3048-3060.
[54]
Kurtcephe M, Altay Güvenir H. A discretization method based on maximizing the area under receiver operating characteristic curve. Intern J Pattern Recognit Artif Intell 2013, Volume 27.
[55]
Ferreira AJ, Figueiredo MAT. Incremental filter and wrapper approaches for feature discretization. Neurocomputing 2014, Volume 123: pp.60-74.
[56]
Yan D, Liu D, Sang Y. A new approach for discretizing continuous attributes in learning systems. Neurocomputing 2014, Volume 133: pp.507-511.
[57]
Sang Y, Qi H, Li K, Jin Y, Yan D, Gao S. An effective discretization method for disposing high-dimensional data. Inform Sci 2014, Volume 270: pp.73-91.
[58]
Nguyen H-V, Müller E, Vreeken J, Böhm K. Unsupervised interaction-preserving discretization of multivariate data. Data Min Knowl Discov 2014, Volume 28: pp.1366-1397.
[59]
Jiang F, Sui Y. A novel approach for discretization of continuous attributes in rough set theory. Knowl Based Syst 2015, Volume 73: pp.324-334.
[60]
Moskovitch R, Shahar Y. Classification-driven temporal discretization of multivariate time series. Data Min Knowl Discov 2015, Volume 29: pp.871-913.
[61]
Ramírez-Gallego S, García S, Benítez JM, Herrera F. Multivariate discretization based on evolutionary cut points selection for classification. IEEE Trans Cybern. In press.
[62]
Shannon CE. A mathematical theory of communication. ACM SIGMOBILE Mob Comput Commun Rev 2001, Volume 5: pp.3-55.
[63]
Alcalá-Fdez J, Sánchez L, García S, <familyNamePrefix>del</familyNamePrefix>Jesus MJ, Ventura S, Garrell JM, Otero J, Romero C, Bacardit J, Rivas VM, et al. KEEL: a software tool to assess evolutionary algorithms for data mining problems. Soft Comput 2009, Volume 13: pp.307-318.
[64]
VerónicaBolón-Canedo, NoeliaSánchez-Maroño, and AmparoAlonso-Betanzos. On the effectiveness of discretization on gene selection of microarray data. In: International Joint Conference on Neural Networks, IJCNN 2010, Barcelona, Spain, 18-23 July, 2010, pages 1-8.
[65]
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A. Feature selection and classification in multiple class datasets: an application to KDD Cup 99 dataset. Expert Syst Appl May 2011, Volume 38: pp.5947-5957.
[66]
Cano A, Ventura S, Cios KJ. Scalable CAIM discretization on multiple GPUs using concurrent kernels. J Supercomput 2014, Volume 69: pp.273-292.
[67]
Zhang Y, Cheung Y-M. Discretizing numerical attributes in decision tree for big data analysis. In: ICDM Workshops, Shenzhen, China, 2014, pages 1150-1157.
[68]
Beyer M.A., Laney D. 3D data management: controlling data volume, velocity and variety, 2001. {Online "http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf"; Accessed March, 2015}.
[69]
Fernández A, <familyNamePrefix>del</familyNamePrefix>Río S, López V, Bawakid A, <familyNamePrefix>del</familyNamePrefix>Jesús MJ, Benítez JM, Herrera F. Big data with cloud computing: an insight on the computing environment, mapreduce, and programming frameworks. WIREs Data Min Knowl Discov 2014, Volume 4: pp.380-409.
[70]
Lin J. Mapreduce is good enough? If all you have is a hammer, throw away everything that's not a nail!. Clin Orthop Relat Res 2012, abs/1209.2191.
[71]
Rio S, Lopez V, Benitez JM, Herrera F. On the use of mapreduce for imbalanced big data using random forest. Inform Sci 2014, Volume 285: pp.112-137.
[72]
Chang C-C, Lin C-J. LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2011, Volume 2: pp.1-27. Datasets available at "http://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/".
[73]
Duda RO, Hart PE. Pattern Classification and Scene Analysis, vol. Volume 3. New York: John Wiley & Sons; 1973.
[74]
Quinlan JR. Induction of decision trees. In: Shavlik JW, Dietterich TG, eds. Readings in Machine Learning. Burlington, MA: Morgan Kaufmann Publishers; 1990. Originally published in Machine Learning 1:81-106, 1986.

Cited By

View all
  • (2024)Frequent Pattern Mining in Continuous-Time Temporal NetworksIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332479946:1(305-321)Online publication date: 1-Jan-2024
  • (2024)Enhancing classification accuracy through feature extraction: a comparative study of discretization and clustering approaches on sensor-based datasetsKnowledge and Information Systems10.1007/s10115-023-01960-066:1(339-356)Online publication date: 1-Jan-2024
  • (2024)RUCIB: a novel rule-based classifier based on BRADO algorithmComputing10.1007/s00607-023-01226-1106:2(495-519)Online publication date: 1-Feb-2024
  • Show More Cited By
  1. Data discretization: taxonomy and big data challenge

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
    Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery  Volume 6, Issue 1
    January 2016
    45 pages

    Publisher

    John Wiley & Sons, Inc.

    United States

    Publication History

    Published: 01 January 2016

    Qualifiers

    • Article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 04 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Frequent Pattern Mining in Continuous-Time Temporal NetworksIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.332479946:1(305-321)Online publication date: 1-Jan-2024
    • (2024)Enhancing classification accuracy through feature extraction: a comparative study of discretization and clustering approaches on sensor-based datasetsKnowledge and Information Systems10.1007/s10115-023-01960-066:1(339-356)Online publication date: 1-Jan-2024
    • (2024)RUCIB: a novel rule-based classifier based on BRADO algorithmComputing10.1007/s00607-023-01226-1106:2(495-519)Online publication date: 1-Feb-2024
    • (2024)A Novel Dynamic Programming Method for Non-parametric Data DiscretizationIntelligent Information and Database Systems10.1007/978-981-97-4982-9_17(215-227)Online publication date: 15-Apr-2024
    • (2023)Feature-based Learning for Diverse and Privacy-Preserving Counterfactual ExplanationsProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599343(2211-2222)Online publication date: 6-Aug-2023
    • (2023)Generalized Interval Type-II Fuzzy Rough Model-Based Feature Discretization for Mixed PixelsIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2022.319062531:3(845-859)Online publication date: 1-Mar-2023
    • (2023)Continuous prediction of a time intervals-related pattern’s completionKnowledge and Information Systems10.1007/s10115-023-01910-w65:11(4797-4846)Online publication date: 1-Nov-2023
    • (2021)Rough fuzzy model based feature discretization in intelligent data preprocessJournal of Cloud Computing: Advances, Systems and Applications10.1186/s13677-020-00216-410:1Online publication date: 18-Jan-2021
    • (2021)Implicitly distributed fuzzy random forestsProceedings of the 36th Annual ACM Symposium on Applied Computing10.1145/3412841.3442082(392-399)Online publication date: 22-Mar-2021
    • (2021)Mobile computing and communications-driven fog-assisted disaster evacuation techniques for context-aware guidance supportComputer Communications10.1016/j.comcom.2021.07.020179:C(195-216)Online publication date: 1-Nov-2021
    • Show More Cited By

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media