[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3419604.3419772acmotherconferencesArticle/Chapter ViewAbstractPublication PagessitaConference Proceedingsconference-collections
research-article

An implementation of a multivariate discretization for supervised learning using Forestdisc

Published: 08 November 2020 Publication History

Abstract

Discretization is a key pre-processing step in Machine Learning that transforms continuous attributes into discrete ones, through different methods available in the literature. In this regard, this work provides the ForestDisc framework that discretizes data based on a supervised, multivariate and hybrid approach. It uses, at first, a splitting process relying on a tree learning ensemble to generate a large set of cut points. It then uses a merging process based on moment matching optimization, to transform this set into a reduced and representative one. ForestDisc is a non-parametric discretizer in the sense that it does not require the user to introduce any initial setting parameters. We implemented ForestDisc algorithm in the "ForestDisc" R package.

References

[1]
James Dougherty, Ron Kohavi, and Mehran Sahami. Supervised and Unsupervised Discretization of Continuous Features. In Machine Learning Proceedings 1995, pages 194--202. Elsevier, 1995.
[2]
Huan Liu, Farhad Hussain, Chew Lim Tan, and Manoranjan Dash. Discretization: An Enabling Technique. Data Mining and Knowledge Discovery, 6:393--423, 2002.
[3]
Ying Yang, Geoffrey I. Webb, and Xindong Wu. Discretization methods. In Oded Maimon and Lior Rokach, editors, Data Mining and Knowledge Discovery Handbook, pages 101--116. Springer US, Boston, MA, 2010.
[4]
Sergio Ram&igr;rez-Gallego, Salvador Garc&igr;a, David Mart&igr;nez-Rego, Jose Manuel Ben&igr;tez, and Francisco Herrera. Data Discretization: Taxonomy and Big Data Challenge. page 26.
[5]
Abdelaziz Berrado and Georger C. Runger. Supervised multivariate discretization in mixed data with Random Forests. In 2009 IEEE/ACS International Conference on Computer Systems and Applications, pages 211--217, Rabat, Morocco, 2009. IEEE.
[6]
A. Berrado, T. Aouam, and S. Boudla. The Random Forest Discretizer with Optimized Split Points Selection. In Proceeding of the 7th International Conference on Intelligent Systems: Theories and Applications, Rabat, Morocco., 2012.
[7]
Kjetil Høyland and Stein W. Wallace. Generating Scenario Trees for Multistage Decision Problems. Management Science, 47(2):295--307, February 2001.
[8]
D. Kraft. A Software Package for Sequential Quadratic Programming. Deutsche Forschungs- Und Versuchsanstalt Für Luft- Und Raumfahrt Köln: Forschungsbericht. Wiss. Berichtswesen d. DFVLR, 1988.
[9]
Dieter Kraft and Iachhochschule Munchen. Algorithm 733: TOMP - Fortran modules for optimal control calculations. ACM Trans. Math. Soft, pages 262--281, 1994.
[10]
Maissae Haddouchi and Abdelaziz Berrado. Optimization methods for discretizing continuous attributes based on moment matching. In International Conference on Optimization and Learning, 2019.
[11]
M. J. D. Powell. A direct search optimization method that models the objective and constraint functions by linear interpolation. In Susana Gomez and JeanPierre Hennart, editors, Advances in Optimization and Numerical Analysis, pages 51--67. Springer Netherlands, Dordrecht, 1994.
[12]
M. J. D. Powell. Direct search algorithms for optimization calculations. Acta Numerica, 7:287--336, January 1998.
[13]
M J D Powell. The BOBYQA algorithm for bound constrained optimization without derivatives. Technical report, Department of Applied Mathematics and Theoretical Physics, Cambridge England, technical report NA2009/06, 2009.
[14]
D. R. Jones, C. D. Perttunen, and B. E. Stuckman. Lipschitzian optimization without the Lipschitz constant. Journal of Optimization Theory and Applications, 79(1):157--181, October 1993.
[15]
J M Gablonsky and C T Kelley. A LOCALLY-BIASED FORM OF THE DIRECT ALGORITHM. page 11.
[16]
Kaj Madsen and Serguei Zertchaninov. Global Optimization using Branch-and-Bound. page 17, 1998.
[17]
S. Zertchaninov, K. Madsen, and A. Zilinskas. A C++ Programme for Global Optimization. IMM Publications, page 14, 1998.
[18]
T.P. Runarsson and Xin Yao. Stochastic ranking for constrained evolutionary optimization. IEEE Transactions on Evolutionary Computation, 4(3):284--294, Sept./2000.
[19]
T.P. Runarsson and X. Yao. Search Biases in Constrained Evolutionary Optimization. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews), 35(2):233--243, May 2005.
[20]
P. Kaelo and M. M. Ali. Some variants of the controlled random search algorithm for global optimization. Journal of Optimization Theory and Applications, 130(2):253--264, August 2006.
[21]
W. L. Price. Global optimization by controlled random search. Journal of Optimization Theory and Applications, 40(3):333--348, July 1983.
[22]
A. H. G. Rinnooy Kan and G. T. Timmer. Stochastic global optimization methods part I: Clustering methods. Mathematical Programming, 39(1):27--56, September 1987.
[23]
Sergei Kucherenko and Yury Sytsko. Application of deterministic low-discrepancy sequences in global optimization. Computational Optimization and Applications, 30(3):297--318, March 2005.
[24]
John A. Nelder and Ronald Mead. A Simplex Method for Function Minimization. Comput. J., 7:308--313, 1965.
[25]
M. J. Box. A New Method of Constrained Optimization and a Comparison With Other Methods. The Computer Journal, 8(1):42--52, April 1965.
[26]
Joel A. Richardson and J. L. Kuester. The Complex Method for Constrained Optimization. Commun. ACM, 16:487--489, 1973.
[27]
Thomas Harvey Rowan. Functional Stability Analysis of Numerical Algorithms. PhD thesis, Ph.D. thesis, Department of Computer Sciences, University of Texas at Austin, 1990.
[28]
Krister Svanberg. A class of globally convergent optimization methods based on conservative convex separable approximations. SIAM Journal on Optimization, pages 555--573, 2002.
[29]
Jorge Nocedal. Updating Quasi-Newton Matrices With Limited Storage. Mathematics of Computation, 35(773--782):10, 1980.
[30]
Dong C. Liu and Jorge Nocedal. On the Limited Memory BFGS Method for Large Scale Optimization. Mathematical Programming, 45:503--528, 1989.
[31]
Ron S. Dembo and Trond Steihaug. Truncated-newtono algorithms for large-scale unconstrained optimization. Mathematical Programming, 26(2):190--212, June 1983.
[32]
Jan Vlcek and Ladislav Luksan. Shifted limited-memory variable metric methods for large-scale unconstrained optimization. Journal of Computational and Applied Mathematics, 186:365--390, February 2006.
[33]
Andrew R. Conn, Nicholas I. M. Gould, Philippe, and L. Toint. A globally convergent augmented lagrangian algorithm for optimization with general constraints and simple bounds. SIAM Journal on Numerical Analysis, page 572, 1991.
[34]
E. G. Birgin and J. M. Martínez. Improving ultimate convergence of an Augmented Lagrangian method. In Optimization Methods and Software, volume vol. 23, no. 2, p.177--195, 2008.
[35]
Dheeru Dua and Casey Graff. UCI machine learning repository. 2017.
[36]
J. Alcalá-Fdez, A. Fernandez, J. Luengo, J. Derrac, S. García, L. Sánchez, and F. Herrera. KEEL Data-Mining Software Tool: Data Set Repository, Integration of algorithms and Experimental analysis Framework. Journal of Multiple-Valued Logic and Soft Computing 17:2--3, pages 255--287, (2011).
[37]
U. M. Fayyad and K. B Irani. Multi-interval discretization of continuous-valued attributes for classification learning. Artificial intelligence, 13:1022--1027., 1993.
[38]
Robert C Holte. Very Simple Classification Rules Perform Well on Most Commonly Used Datasets. Machine Learning, page 32, 1993.
[39]
Huan Liu and Rudy Setiono. Chi2: Feature Selection and Discretization of Numeric Attributes. In In Proceedings of the Seventh International Conference on Tools with Artificial Intelligence, pages 388--391, 1995.
[40]
Randy Kerber. ChiMerge: Discretization of numeric attributes. In Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI'92, pages 123--128, San Jose, California, 1992. AAAI Press.
[41]
Chao-Ton Su and Jyh-Hwa Hsu. An extended Chi2 algorithm for discretization of real value attributes. IEEE Transactions on Knowledge and Data Engineering, 17(3):437--441, March 2005.
[42]
L.A. Kurgan and K.J. Cios. CAIM discretization algorithm. IEEE Transactions on Knowledge and Data Engineering, 16(2):145--153, February 2004.
[43]
L. Gonzalez-Abril, F.J. Cuberos, F. Velasco, and J.A. Ortega. Ameva: An autonomous discretization algorithm. Expert Systems with Applications, 36(3):5327--5332, April 2009.
[44]
Leo Breiman. Random Forests. Machine Learning, 45(1):5--32, October 2001.
[45]
Jerome H. Friedman. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics, 29:1189--1232, 2000.
[46]
Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '16, pages 785--794, San Francisco, California, USA, 2016. ACM Press.
[47]
Richard J. Samworth. Optimal weighted nearest neighbour classifiers. The Annals of Statistics, 40(5):2733--2763, October 2012.
[48]
Nir Friedman, Dan Geiger, and Moises Goldszmidt. Bayesian network classifiers. Machine Learning, 29(2):131--163, November 1997.
[49]
R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria, 2019.
[50]
Maïssae Haddouchi. ForestDisc: Forest Discretization. R package version 0.1.0. htttps://CRAN.R-project.org/package=ForestDisc, 2020.

Cited By

View all
  • (2022)A novel approach for discretizing continuous attributes based on tree ensemble and moment matching optimizationInternational Journal of Data Science and Analytics10.1007/s41060-022-00316-114:1(45-63)Online publication date: 7-Mar-2022
  • (2022)Tuning ForestDisc Hyperparameters: A Sensitivity AnalysisOptimization and Learning10.1007/978-3-031-22039-5_3(25-36)Online publication date: 11-Dec-2022

Index Terms

  1. An implementation of a multivariate discretization for supervised learning using Forestdisc

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    SITA'20: Proceedings of the 13th International Conference on Intelligent Systems: Theories and Applications
    September 2020
    333 pages
    ISBN:9781450377331
    DOI:10.1145/3419604
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 November 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data Preprocessing
    2. Moment Matching
    3. Multivariate Discretization
    4. Random Forest
    5. Split Points Selection
    6. Tree Ensembles

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    SITA'20
    SITA'20: Theories and Applications
    September 23 - 24, 2020
    Rabat, Morocco

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 16 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)A novel approach for discretizing continuous attributes based on tree ensemble and moment matching optimizationInternational Journal of Data Science and Analytics10.1007/s41060-022-00316-114:1(45-63)Online publication date: 7-Mar-2022
    • (2022)Tuning ForestDisc Hyperparameters: A Sensitivity AnalysisOptimization and Learning10.1007/978-3-031-22039-5_3(25-36)Online publication date: 11-Dec-2022

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media