More Web Proxy on the site http://driver.im/

research-article

An implementation of a multivariate discretization for supervised learning using Forestdisc

Authors:

Maissae Haddouchi,

Abdelaziz BerradoAuthors Info & Claims

SITA'20: Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications

Article No.: 8, Pages 1 - 6

https://doi.org/10.1145/3419604.3419772

Published: 08 November 2020 Publication History

Abstract

Discretization is a key pre-processing step in Machine Learning that transforms continuous attributes into discrete ones, through different methods available in the literature. In this regard, this work provides the ForestDisc framework that discretizes data based on a supervised, multivariate and hybrid approach. It uses, at first, a splitting process relying on a tree learning ensemble to generate a large set of cut points. It then uses a merging process based on moment matching optimization, to transform this set into a reduced and representative one. ForestDisc is a non-parametric discretizer in the sense that it does not require the user to introduce any initial setting parameters. We implemented ForestDisc algorithm in the "ForestDisc" R package.

References

[1]

James Dougherty, Ron Kohavi, and Mehran Sahami. Supervised and Unsupervised Discretization of Continuous Features. In Machine Learning Proceedings 1995, pages 194--202. Elsevier, 1995.

Digital Library

[2]

Huan Liu, Farhad Hussain, Chew Lim Tan, and Manoranjan Dash. Discretization: An Enabling Technique. Data Mining and Knowledge Discovery, 6:393--423, 2002.

Digital Library

[3]

Ying Yang, Geoffrey I. Webb, and Xindong Wu. Discretization methods. In Oded Maimon and Lior Rokach, editors, Data Mining and Knowledge Discovery Handbook, pages 101--116. Springer US, Boston, MA, 2010.

[4]

Sergio Ram&igr;rez-Gallego, Salvador Garc&igr;a, David Mart&igr;nez-Rego, Jose Manuel Ben&igr;tez, and Francisco Herrera. Data Discretization: Taxonomy and Big Data Challenge. page 26.

[5]

Abdelaziz Berrado and Georger C. Runger. Supervised multivariate discretization in mixed data with Random Forests. In 2009 IEEE/ACS International Conference on Computer Systems and Applications, pages 211--217, Rabat, Morocco, 2009. IEEE.

[6]

A. Berrado, T. Aouam, and S. Boudla. The Random Forest Discretizer with Optimized Split Points Selection. In Proceeding of the 7th International Conference on Intelligent Systems: Theories and Applications, Rabat, Morocco., 2012.

[7]

Kjetil Høyland and Stein W. Wallace. Generating Scenario Trees for Multistage Decision Problems. Management Science, 47(2):295--307, February 2001.

Digital Library

[8]

D. Kraft. A Software Package for Sequential Quadratic Programming. Deutsche Forschungs- Und Versuchsanstalt Für Luft- Und Raumfahrt Köln: Forschungsbericht. Wiss. Berichtswesen d. DFVLR, 1988.

[9]

Dieter Kraft and Iachhochschule Munchen. Algorithm 733: TOMP - Fortran modules for optimal control calculations. ACM Trans. Math. Soft, pages 262--281, 1994.

Digital Library

[10]

Maissae Haddouchi and Abdelaziz Berrado. Optimization methods for discretizing continuous attributes based on moment matching. In International Conference on Optimization and Learning, 2019.

[11]

M. J. D. Powell. A direct search optimization method that models the objective and constraint functions by linear interpolation. In Susana Gomez and JeanPierre Hennart, editors, Advances in Optimization and Numerical Analysis, pages 51--67. Springer Netherlands, Dordrecht, 1994.

[12]

M. J. D. Powell. Direct search algorithms for optimization calculations. Acta Numerica, 7:287--336, January 1998.

[13]

M J D Powell. The BOBYQA algorithm for bound constrained optimization without derivatives. Technical report, Department of Applied Mathematics and Theoretical Physics, Cambridge England, technical report NA2009/06, 2009.

[14]

D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157--181, October 1993.

[15]

J M Gablonsky and C T Kelley. A LOCALLY-BIASED FORM OF THE DIRECT ALGORITHM. page 11.

[16]

Kaj Madsen and Serguei Zertchaninov. Global Optimization using Branch-and-Bound. page 17, 1998.

[17]

S. Zertchaninov, K. Madsen, and A. Zilinskas. A C++ Programme for Global Optimization. IMM Publications, page 14, 1998.

[18]

T.P. Runarsson and Xin Yao. Stochastic ranking for constrained evolutionary optimization. IEEE Transactions on Evolutionary Computation, 4(3):284--294, Sept./2000.

Digital Library

[19]

T.P. Runarsson and X. Yao. Search Biases in Constrained Evolutionary Optimization. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews), 35(2):233--243, May 2005.

Digital Library

[20]

P. Kaelo and M. M. Ali. Some variants of the controlled random search algorithm for global optimization. Journal of Optimization Theory and Applications, 130(2):253--264, August 2006.

Digital Library

[21]

W. L. Price. Global optimization by controlled random search. Journal of Optimization Theory and Applications, 40(3):333--348, July 1983.

[22]

A. H. G. Rinnooy Kan and G. T. Timmer. Stochastic global optimization methods part I: Clustering methods. Mathematical Programming, 39(1):27--56, September 1987.

Digital Library

[23]

Sergei Kucherenko and Yury Sytsko. Application of deterministic low-discrepancy sequences in global optimization. Computational Optimization and Applications, 30(3):297--318, March 2005.

Digital Library

[24]

John A. Nelder and Ronald Mead. A Simplex Method for Function Minimization. Comput. J., 7:308--313, 1965.

[25]

M. J. Box. A New Method of Constrained Optimization and a Comparison With Other Methods. The Computer Journal, 8(1):42--52, April 1965.

[26]

Joel A. Richardson and J. L. Kuester. The Complex Method for Constrained Optimization. Commun. ACM, 16:487--489, 1973.

Digital Library

[27]

Thomas Harvey Rowan. Functional Stability Analysis of Numerical Algorithms. PhD thesis, Ph.D. thesis, Department of Computer Sciences, University of Texas at Austin, 1990.

[28]

Krister Svanberg. A class of globally convergent optimization methods based on conservative convex separable approximations. SIAM Journal on Optimization, pages 555--573, 2002.

Digital Library

[29]

Jorge Nocedal. Updating Quasi-Newton Matrices With Limited Storage. Mathematics of Computation, 35(773--782):10, 1980.

[30]

Dong C. Liu and Jorge Nocedal. On the Limited Memory BFGS Method for Large Scale Optimization. Mathematical Programming, 45:503--528, 1989.

Digital Library

[31]

Ron S. Dembo and Trond Steihaug. Truncated-newtono algorithms for large-scale unconstrained optimization. Mathematical Programming, 26(2):190--212, June 1983.

Digital Library

[32]

Jan Vlcek and Ladislav Luksan. Shifted limited-memory variable metric methods for large-scale unconstrained optimization. Journal of Computational and Applied Mathematics, 186:365--390, February 2006.

Digital Library

[33]

Andrew R. Conn, Nicholas I. M. Gould, Philippe, and L. Toint. A globally convergent augmented lagrangian algorithm for optimization with general constraints and simple bounds. SIAM Journal on Numerical Analysis, page 572, 1991.

Digital Library

[34]

E. G. Birgin and J. M. Martínez. Improving ultimate convergence of an Augmented Lagrangian method. In Optimization Methods and Software, volume vol. 23, no. 2, p.177--195, 2008.

Digital Library

[35]

Dheeru Dua and Casey Graff. UCI machine learning repository. 2017.

[36]

J. Alcalá-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. García, L. Sánchez, and F. Herrera. KEEL Data-Mining Software Tool: Data Set Repository, Integration of algorithms and Experimental analysis Framework. Journal of Multiple-Valued Logic and Soft Computing 17:2--3, pages 255--287, (2011).

[37]

U. M. Fayyad and K. B Irani. Multi-interval discretization of continuous-valued attributes for classification learning. Artificial intelligence, 13:1022--1027., 1993.

[38]

Robert C Holte. Very Simple Classification Rules Perform Well on Most Commonly Used Datasets. Machine Learning, page 32, 1993.

Digital Library

[39]

Huan Liu and Rudy Setiono. Chi2: Feature Selection and Discretization of Numeric Attributes. In In Proceedings of the Seventh International Conference on Tools with Artificial Intelligence, pages 388--391, 1995.

Digital Library

[40]

Randy Kerber. ChiMerge: Discretization of numeric attributes. In Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI'92, pages 123--128, San Jose, California, 1992. AAAI Press.

Digital Library

[41]

Chao-Ton Su and Jyh-Hwa Hsu. An extended Chi2 algorithm for discretization of real value attributes. IEEE Transactions on Knowledge and Data Engineering, 17(3):437--441, March 2005.

Digital Library

[42]

L.A. Kurgan and K.J. Cios. CAIM discretization algorithm. IEEE Transactions on Knowledge and Data Engineering, 16(2):145--153, February 2004.

Digital Library

[43]

L. Gonzalez-Abril, F.J. Cuberos, F. Velasco, and J.A. Ortega. Ameva: An autonomous discretization algorithm. Expert Systems with Applications, 36(3):5327--5332, April 2009.

Digital Library

[44]

Leo Breiman. Random Forests. Machine Learning, 45(1):5--32, October 2001.

Digital Library

[45]

Jerome H. Friedman. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics, 29:1189--1232, 2000.

[46]

Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '16, pages 785--794, San Francisco, California, USA, 2016. ACM Press.

Digital Library

[47]

Richard J. Samworth. Optimal weighted nearest neighbour classifiers. The Annals of Statistics, 40(5):2733--2763, October 2012.

[48]

Nir Friedman, Dan Geiger, and Moises Goldszmidt. Bayesian network classifiers. Machine Learning, 29(2):131--163, November 1997.

Digital Library

[49]

R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria, 2019.

[50]

Maïssae Haddouchi. ForestDisc: Forest Discretization. R package version 0.1.0. htttps://CRAN.R-project.org/package=ForestDisc, 2020.

Cited By

Maissae HAbdelaziz B(2022)A novel approach for discretizing continuous attributes based on tree ensemble and moment matching optimizationInternational Journal of Data Science and Analytics10.1007/s41060-022-00316-114:1(45-63)Online publication date: 7-Mar-2022
https://doi.org/10.1007/s41060-022-00316-1
Haddouchi MBerrado A(2022)Tuning ForestDisc Hyperparameters: A Sensitivity AnalysisOptimization and Learning10.1007/978-3-031-22039-5_3(25-36)Online publication date: 11-Dec-2022
https://doi.org/10.1007/978-3-031-22039-5_3

Index Terms

An implementation of a multivariate discretization for supervised learning using Forestdisc
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms

Recommendations

An ICA-Based multivariate discretization algorithm
KSEM'06: Proceedings of the First international conference on Knowledge Science, Engineering and Management

Discretization is an important preprocessing technique in data mining tasks. Univariate Discretization is the most commonly used method. It discretizes only one single attribute of a dataset at a time, without considering the interaction information ...
Multivariate Interdependent Discretization for Continuous Attribute
ICITA '05: Proceedings of the Third International Conference on Information Technology and Applications (ICITA'05) Volume 2 - Volume 02

Decision tree is one of the most widely used and practical methods in the data mining and machine learning discipline. However, many discretization algorithms developed in this field focus on univariate only, which is inadequate to handle the critical ...
Graph clustering-based discretization approach to microarray data

Several techniques in data mining require discrete data. In fact, learning with discrete domains often performs better than the case of continuous data. Multivariate discretization is the algorithm that transforms continuous data to discrete one by ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

SITA'20: Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications

September 2020

333 pages

ISBN:9781450377331

DOI:10.1145/3419604

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 November 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SITA'20

SITA'20: Theories and Applications

September 23 - 24, 2020

Rabat, Morocco

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
26
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Maissae HAbdelaziz B(2022)A novel approach for discretizing continuous attributes based on tree ensemble and moment matching optimizationInternational Journal of Data Science and Analytics10.1007/s41060-022-00316-114:1(45-63)Online publication date: 7-Mar-2022
https://doi.org/10.1007/s41060-022-00316-1
Haddouchi MBerrado A(2022)Tuning ForestDisc Hyperparameters: A Sensitivity AnalysisOptimization and Learning10.1007/978-3-031-22039-5_3(25-36)Online publication date: 11-Dec-2022
https://doi.org/10.1007/978-3-031-22039-5_3

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten