More Web Proxy on the site http://driver.im/

research-article

Empirical Analysis of Filter Feature Selection Criteria on Financial Datasets

Authors:

Bui Quoc Trung,

Bui Thi-Mai-AnhAuthors Info & Claims

SoICT '22: Proceedings of the 11th International Symposium on Information and Communication Technology

Pages 413 - 419

https://doi.org/10.1145/3568562.3568604

Published: 01 December 2022 Publication History

Abstract

High dimensionality is one of the data quality problems that affects the performance of machine learning models. Feature selection which aims to identify and remove as many redundant and irrelevant features as possible allows to boot the overall performance of the models while reducing the computational cost. However, the choice of an appropriate feature selection method is still a big challenge as there is no the best selection criterion that fits to all datasets. It is then essential to comparatively analyze the performance of feature selection criteria according to different characteristics of high-dimensional datasets, particularly large financial datasets where features are highly-correlated and redundant. In this paper, we explore nine different feature selection criteria which are typically categorized into two classes: (i) information theoretical based criteria and (ii) similarity based criteria over seven public financial datasets. To the best of our knowledge, no previous comprehensive empirical investigation has been carried out to demonstrate the positive effects of feature selection criteria on financial data. Experimental results indicate that the information theoretical-based methods suffer from a high computation time in case of high dimensional data (i.e., high number of features) while the similarity-based methods require significant computations to deal with high volume dataset (i.e., high number of samples).

References

[1]

Raymond Anderson. 2007. The credit scoring toolkit: theory and practice for retail credit risk management and decision automation. Oxford University Press.

[2]

Roberto Battiti. 1994. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on neural networks 5, 4 (1994), 537–550.

Digital Library

[3]

Verónica Bolón-Canedo, Noelia Sánchez-Maroño, and Amparo Alonso-Betanzos. 2015. Recent advances and emerging challenges of feature selection in the context of big data. Knowledge-based systems 86 (2015), 33–45.

[4]

Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luján. 2012. Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. The journal of machine learning research 13 (2012), 27–66.

[5]

Zheng Chen, Meng Pang, Zixin Zhao, Shuainan Li, Rui Miao, Yifan Zhang, Xiaoyue Feng, Xin Feng, Yexian Zhang, Meiyu Duan, 2020. Feature selection may improve deep neural networks for the bioinformatics problems. Bioinformatics 36, 5 (2020), 1542–1552.

[6]

Baljinder Ghotra, Shane McIntosh, and Ahmed E Hassan. 2017. A large-scale study of the impact of feature selection techniques on defect classification models. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 146–157.

Digital Library

[7]

D Asir Antony Gnana, S Appavu Alias Balamurugan, and E Jebamalar Leavline. 2016. Literature review on feature selection methods for high-dimensional data. International Journal of Computer Applications 136, 1(2016), 9–17.

[8]

Robert M Gray. 2011. Entropy and information theory. Springer Science & Business Media.

[9]

Peter E Hart, David G Stork, and Richard O Duda. 2000. Pattern classification. Wiley Hoboken.

[10]

Xiaofei He, Deng Cai, and Partha Niyogi. 2005. Laplacian score for feature selection. Advances in neural information processing systems 18 (2005).

[11]

Firuz Kamalov and Fadi Thabtah. 2017. A feature selection method based on ranked vector scores of features for classification. Annals of Data Science 4, 4 (2017), 483–502.

[12]

Nikita Kozodoi, Stefan Lessmann, Konstantinos Papakonstantinou, Yiannis Gatsoulis, and Bart Baesens. 2019. A multi-objective approach for profit-driven feature selection in credit scoring. Decision support systems 120 (2019), 106–117.

[13]

Stefan Lessmann, Bart Baesens, Hsin-Vonn Seow, and Lyn C Thomas. 2015. Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research 247, 1 (2015), 124–136.

[14]

David D Lewis. 1992. Feature selection and feature extraction for text categorization. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992.

Digital Library

[15]

Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino, Jiliang Tang, and Huan Liu. 2017. Feature selection: A data perspective. ACM computing surveys (CSUR) 50, 6 (2017), 1–45.

[16]

Hongliang Liang, Lu Sun, Meilin Wang, and Yuxing Yang. 2019. Deep learning with customized abstract syntax tree for bug localization. IEEE Access 7(2019), 116309–116320.

[17]

Dahua Lin and Xiaoou Tang. 2006. Conditional infomax learning: An integrated framework for feature extraction and fusion. In European conference on computer vision. Springer, 68–82.

Digital Library

[18]

Sebastián Maldonado, Juan Pérez, and Cristián Bravo. 2017. Cost-based feature selection for support vector machines: An application in credit scoring. European Journal of Operational Research 261, 2 (2017), 656–665.

[19]

David Alfred Ostrowski. 2014. Feature selection for twitter classification. In 2014 IEEE International Conference on Semantic Computing. IEEE, 267–272.

Digital Library

[20]

Kunal Pahwa and Neha Agarwal. 2019. Stock market analysis using supervised machine learning. In 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon). IEEE, 197–200.

[21]

Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence 27, 8(2005), 1226–1238.

Digital Library

[22]

Khairan D Rajab. 2017. New hybrid features selection method: A case study on websites phishing. Security and Communication Networks 2017 (2017).

[23]

Marko Robnik-Šikonja and Igor Kononenko. 2003. Theoretical and empirical analysis of ReliefF and RReliefF. Machine learning 53, 1 (2003), 23–69.

[24]

Lyn Thomas, Jonathan Crook, and David Edelman. 2017. Credit scoring and its applications. SIAM.

[25]

Shrawan Kumar Trivedi. 2020. A study on credit scoring modeling with different feature selection and machine learning approaches. Technology in Society 63(2020), 101413.

Cited By

Quoc Trung BHoang Minh VThi Hoai Linh NThi Mai Anh B(2024)A Novel Dynamic Programming Method for Non-parametric Data DiscretizationIntelligent Information and Database Systems10.1007/978-981-97-4982-9_17(215-227)Online publication date: 16-Jul-2024
https://doi.org/10.1007/978-981-97-4982-9_17

Index Terms

Empirical Analysis of Filter Feature Selection Criteria on Financial Datasets

Recommendations

Empirical Study of Individual Feature Evaluators and Cutting Criteria for Feature Selection in Classification
ISDA '09: Proceedings of the 2009 Ninth International Conference on Intelligent Systems Design and Applications

The use of feature selection can improve accuracy, efficiency, applicability and understandability of a learning process and its resulting model. For this reason, many methods of automatic feature selection have been developed. By using a modularization ...
Feature Selection for financial data – comparison
Abstract
Data analysis is currently one the key for the success of good condition of the companies. Feature selection as a preprocessing of data method the estimator accuracy scores can be improved, as well the performance on very high-dimensional data set ...
Correlation Based Feature Selection Algorithms for Varying Datasets of Different Dimensionality
Abstract
Curse of dimensionality problem needs to be addressed carefully when designing a classifier. Given a huge dimensional dataset, one interesting problem is the choice of optimal selection of features for classification. Feature selection is an ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

SoICT '22: Proceedings of the 11th International Symposium on Information and Communication Technology

December 2022

474 pages

ISBN:9781450397254

DOI:10.1145/3568562

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SoICT 2022

SoICT 2022: The 11th International Symposium on Information and Communication Technology

December 1 - 3, 2022

Hanoi, Vietnam

Acceptance Rates

Overall Acceptance Rate 147 of 318 submissions, 46%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
32
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)2

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Quoc Trung BHoang Minh VThi Hoai Linh NThi Mai Anh B(2024)A Novel Dynamic Programming Method for Non-parametric Data DiscretizationIntelligent Information and Database Systems10.1007/978-981-97-4982-9_17(215-227)Online publication date: 16-Jul-2024
https://doi.org/10.1007/978-981-97-4982-9_17

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents