More Web Proxy on the site http://driver.im/

research-article

The optimal combination of feature selection and data discretization: : An empirical study

Authors:

Chih-Fong Tsai,

Yu-Chi ChenAuthors Info & Claims

Volume 505, Issue C

Pages 282 - 293

https://doi.org/10.1016/j.ins.2019.07.091

Published: 01 December 2019 Publication History

Highlights

•

The best combination of feature selection and discretization is identified.

•

Particularly, there are two different orderings of combining the two data pre-processing steps.

•

Three feature selection and four discretizers are used for performance comparison.

•

Performing MDLP first and C4.5 second is the better choice for the SVM classifier.

•

Performing C4.5 first and MDLP second is recommended for the DT classifier.

Abstract

Feature selection and data discretization are two important data pre-processing steps in data mining, with the focus in the former being on filtering out unrepresentative features and in the latter on transferring continuous attributes into discrete ones. In the literature, these two domain problems have often been studied, individually. However, the combination of these two steps has not been fully explored, although both feature selection and discretization may be required for some real-world datasets. In this paper, two different combination orders of feature selection and discretization are examined in terms of their classification accuracies and computational times. Specifically, filter, wrapper, and embedded feature selection methods are employed, which are PCA, GA, and C4.5, respectively. For discretization, both supervised and unsupervised learning based discretizers are used, specifically MDLP, ChiMerge, equal frequency binning, and equal width binning. The experimental results, based on 10 UCI datasets, show that, for the SVM classifier performing MDLP first and C4.5 second outperforms the other combinations. Not only is less computational time required but this also provides the highest rate of classification accuracy. For the decision tree classifier, performing C4.5 first and MDLP second is recommended.

References

[1]

R. Abraham, J.B. Simha, S.S. Iyengar, Effective discretization and hybrid feature selection using naïve bayesian classifier for medical data mining, Int. J. Comput. Intell. Res. 5 (2) (2009) 116–129.

[2]

R. Agrawal, R. Srikant, Fast algorithms for mining association rules, in: Proceedings of the 20th International Conference on Very Large Data Bases, 1994, pp. 487–499.

[3]

R. Ali, M.H. Siddiqi, S. Lee, Rough set-based approaches for discretization: a compact review, Artif. Intell. Rev. 44 (2) (2015) 235–263.

[4]

A.A. Bakar, Z.A. Othman, N.L.M. Shuib, Building a new taxonomy for data discretization techniques, in: International Conference on Data Mining and Optimization, 2009, pp. 132–140.

[5]

I. Borg, P. Groenen, Modern Multidimensional scaling: Theory and applications, 2nd Edition, Springer-Verlag, 2005.

[6]

L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees, Wadsworth International Group, CA, 1984.

[7]

G. Chandrashekar, F. Sahin, A survey on feature selection methods, Comput. Electr. Eng. 40 (2014) 16–28.

Digital Library

[8]

Y.-.S. Choi, B.-.R. Moon, Feature selection in genetic fuzzy discretization for the pattern classification problems, IEICE Trans. Inf. Syst. E90-D (7) (2007) 1047–1054.

[9]

M. Dash, H. Liu, Feature selection for classification, Intell. Data Anal. 1 (1997) 131–156.

[10]

J.J. Davis, A.J. Clark, Data preprocessing for anomaly based network intrusion detection: a review, Comput. Secur. 30 (2011) 353–375.

[11]

J. Dougherty, R. Kohavi, M. Sahami, Supervised and unsupervised discretization of continuous features, in: International Conference on Machine Learning, 1995, pp. 194–202.

[12]

U.M. Fayyad, K.B. Irani, Multi-interval discretization of continuous-valued attributes for classification learning, in: International Joint Conference on Artificial Intelligence, 1993, pp. 1022–1029.

[13]

A.J. Ferreira, M.A.T. Figueiredo, An unsupervised approach to feature discretization and selection, Pattern Recognit. 45 (9) (2012) 3048–3060.

Digital Library

[14]

M.J. Flores, J.A. Gamez, A.M. Martinez, J.M. Puerta, Handling numeric attributes when comparing bayesian network classifiers: does the discretization method matter?, Appl. Intell. 34 (2011) 372–385.

Digital Library

[15]

G. Forman, An extensive empirical study of feature selection metrics for text classification, J. Mach. Learn. Res. 3 (2003) 1289–1305.

[16]

S. Garcia, J. Luengo, J.A. Saez, V. Lopez, F. Herrera, A survey of discretization techniques: taxonomy and empirical analysis in supervised learning, IEEE Trans. Knowl. Data Eng. 25 (4) (2013) 734–750.

[17]

I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. Learn. Res. 3 (2003) 1157–1182.

[18]

S.L. Jaba, V. Shanthi, An approach for discretization and feature selection of continuous-valued attributes in medical images for classification learning, Int. J. Comput. Electr. Eng. 1 (2) (2009) 179–183.

[19]

S.S. Kannan, N. Ramaraj, An improved correlation-based algorithm with discretization for attribute reduction in data clustering, Data Sci. J. 8 (2009) 125–138.

[20]

R. Kerber, ChiMerge: discretization of numeric attributes, in: AAAI Conference on Artificial Intelligence, 1992, pp. 123–128.

[21]

S. Kotsiantis, D. Kanellopoulos, Discretization techniques: a recent survey, GESTS Int. Trans. Comput. Sci. Eng. 32 (1) (2006) 47–58.

[22]

T.K. Kumar, Multicollinearity in regression analysis, Rev. Econ. Stat. 57 (3) (1975) 365–366.

[23]

C. Lazar, J. Taminau, S. Meganck, D. Steenhoff, A. Coletta, C. Molter, V. de Schaetzen, R. Duque, H. Bersini, A. Nowe, A survey on filter techniques for feature selection in gene expression microarray analysis, IEEE/ACM Trans. Comput. Biol. Bioinform. 9 (4) (2012) 1106–1119.

Digital Library

[24]

P.Y. Lee, W.P. Loh, J.F. Chin, Feature selection in multimedia: the state-of-the-art review, Image Vis. Comput. 67 (2017) 29–42.

[25]

Y. Li, T. Li, H. Liu, Recent advances in feature selection and its applications, Knowl. Inf. Syst. 53 (3) (2017) 551–577.

Digital Library

[26]

H. Liu, F. Hussain, C.L. Tan, M. Dash, Discretization: an enabling technique, Data Min. Knowl. Discov. 6 (4) (2002) 393–423.

[27]

H. Liu, R. Setiono, Feature selection via discretization, IEEE Trans. Knowl. Data Eng. 9 (4) (1997) 642–645.

[28]

G. Madhu, T.V. Rajinikanth, A. Govardhan, Feature selection algorithm with discretization and PSO search methods for continuous attributes, Int. J. Comput. Sci. Inf. Technol. 5 (2) (2014) 1398–1402.

[29]

M. Morchid, R. Dufour, P.-.M. Bousquet, G. Linares, J.-.M. Torres-Moreno, Feature selection using principal component analysis for massive retweet detection, Pattern Recognit. Lett. 49 (2014) 33–39.

Digital Library

[30]

R.B. Pereira, A. Plastino, B. Zadrozny, L.H.C. Merschmann, Categorizing feature selection methods for multi-label classification, Artif. Intell. Rev. 49 (1) (2018) 57–78.

Digital Library

[31]

H. Qian, Z. Qiu, Feature selection using C4.5 algorithm for electricity price prediction, in: International Conference on Machine Learning and Cybernetics, 2014, pp. 175–180.

[32]

J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, 1993.

[33]

M.X. Ribeiro, A.J.M. Traina, C Traina Jr., A new algorithm for data discretization and feature selection, in: ACM Symposium on Applied Computing, 2008, pp. 953–954.

[34]

Y. Saeys, I. Inza, P. Larranaga, A review of feature selection techniques in bioinformatics, Bioinformatics 23 (19) (2007) 2507–2517.

Digital Library

[35]

D. Santoni, E. Weitschek, G. Felici, Optimal discretization and selection of features by association rates of joint distributions, RAIRO 50 (2016) 437–449.

[36]

D. Tian, X.-.J. Zeng, J. Keane, Core-generating discretization for rough set feature selection, in: J.F. Peters, A. Skowron (Eds.), Transactions On Rough Sets XIII, Springer, 2011, pp. 135–158.

[37]

B. Tran, B. Xue, M. Zhang, A new representation in PSO for discretization-based feature selection, IEEE Trans. Cybern. 48 (6) (2018) 1733–1746.

[38]

C.-.F. Tsai, W. Eberle, C.-.Y. Chu, Genetic algorithms in feature and instance selection, Knowl. Based Syst. 39 (2013) 240–247.

[39]

X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J. McLachlan, A. Ng, B. Liu, P.S. Yu, Z.-.H. Zhou, M. Steinbach, D.J. Hand, D. Steinberg, Top 10 algorithms in data mining, Knowl. Inf. Syst. 14 (1) (2008) 1–37.

Digital Library

[40]

Y. Yang, G.I. Webb, X. Wu, Discretization methods, in: Maimon, Rokach (Eds.), Data Mining and Knowledge Discovery Handbook, Springer, 2010, pp. 101–116.

[41]

Y. Yang, G.I. Webb, Discretization for naïve-Bayes learning: managing discretization bias and variance, Mach. Learn. 74 (1) (2009) 39–74.

Cited By

Tan YZhu HWu JChai H(2024)DPTVAEExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122071241:COnline publication date: 1-May-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.122071
Fu RWu YXu QZhang M(2023)FEAST: A Communication-efficient Federated Feature Selection Framework for Relational DataProceedings of the ACM on Management of Data10.1145/35889611:1(1-28)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588961
Said RElarbi MBechikh SCoello Coello CSaid L(2023)Discretization-Based Feature Selection as a Bilevel Optimization ProblemIEEE Transactions on Evolutionary Computation10.1109/TEVC.2022.319211327:4(893-907)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1109/TEVC.2022.3192113
Show More Cited By

Index Terms

The optimal combination of feature selection and data discretization: An empirical study
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning algorithms
      1. Feature selection
2. Information systems
  1. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

Feature selection and its combination with data over-sampling for multi-class imbalanced datasets
Abstract
Feature selection aims at filtering out some unrepresentative features from a given dataset in order to construct more effective learning models. Furthermore, ensemble feature selection by combining multiple feature selection methods has shown ...
Highlights
- Filter, wrapper, and embedded feature selection in multi-class imbalanced datasets are studied.
- Ensemble feature selection methods are further compared with single feature selection methods.
- Two orders of combining the feature ...
Feature Selection via Discretization

Discretization can turn numeric attributes into discrete ones. Feature selection can eliminate some irrelevant and/or redundant attributes. Chi2 is a simple and general algorithm that uses the 2 statistic to discretize numeric attributes repeatedly ...
Genetic algorithms in feature and instance selection

Feature selection and instance selection are two important data preprocessing steps in data mining, where the former is aimed at removing some irrelevant and/or redundant features from a given dataset and the latter at discarding the faulty data. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Information Sciences: an International Journal

Information Sciences: an International Journal Volume 505, Issue C

Dec 2019

600 pages

ISSN:0020-0255

Issue’s Table of Contents

Elsevier Inc.

Publisher

Elsevier Science Inc.

United States

Publication History

Published: 01 December 2019

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tan YZhu HWu JChai H(2024)DPTVAEExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.122071241:COnline publication date: 1-May-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.122071
Fu RWu YXu QZhang M(2023)FEAST: A Communication-efficient Federated Feature Selection Framework for Relational DataProceedings of the ACM on Management of Data10.1145/35889611:1(1-28)Online publication date: 30-May-2023
https://dl.acm.org/doi/10.1145/3588961
Said RElarbi MBechikh SCoello Coello CSaid L(2023)Discretization-Based Feature Selection as a Bilevel Optimization ProblemIEEE Transactions on Evolutionary Computation10.1109/TEVC.2022.319211327:4(893-907)Online publication date: 1-Aug-2023
https://dl.acm.org/doi/10.1109/TEVC.2022.3192113
Yang LShami A(2023)IoT data analytics in dynamic environmentsEngineering Applications of Artificial Intelligence10.1016/j.engappai.2022.105366116:COnline publication date: 20-Jan-2023
https://dl.acm.org/doi/10.1016/j.engappai.2022.105366
Chaieb SMrad AHnich B(2023)Obsolete personal information update system: towards the prevention of falls in the elderlyApplied Intelligence10.1007/s10489-022-04289-353:14(18061-18084)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1007/s10489-022-04289-3
Zhang SWang Y(2022)An improved software defect prediction model based on grey incidence analysis and Naive Bayes algorithmJournal of Intelligent & Fuzzy Systems: Applications in Engineering and Technology10.3233/JIFS-21357043:5(6047-6060)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.3233/JIFS-213570
Hishamuddin MHassan MMokhtar A(2020)Improving Classification Accuracy of Random Forest Algorithm Using Unsupervised Discretization with Fuzzy Partition and Fuzzy Set IntervalsProceedings of the 2020 9th International Conference on Software and Computer Applications10.1145/3384544.3384590(99-104)Online publication date: 18-Feb-2020
https://dl.acm.org/doi/10.1145/3384544.3384590

View Options

View options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents