[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3568562.3568604acmotherconferencesArticle/Chapter ViewAbstractPublication PagessoictConference Proceedingsconference-collections
research-article

Empirical Analysis of Filter Feature Selection Criteria on Financial Datasets

Published: 01 December 2022 Publication History

Abstract

High dimensionality is one of the data quality problems that affects the performance of machine learning models. Feature selection which aims to identify and remove as many redundant and irrelevant features as possible allows to boot the overall performance of the models while reducing the computational cost. However, the choice of an appropriate feature selection method is still a big challenge as there is no the best selection criterion that fits to all datasets. It is then essential to comparatively analyze the performance of feature selection criteria according to different characteristics of high-dimensional datasets, particularly large financial datasets where features are highly-correlated and redundant. In this paper, we explore nine different feature selection criteria which are typically categorized into two classes: (i) information theoretical based criteria and (ii) similarity based criteria over seven public financial datasets. To the best of our knowledge, no previous comprehensive empirical investigation has been carried out to demonstrate the positive effects of feature selection criteria on financial data. Experimental results indicate that the information theoretical-based methods suffer from a high computation time in case of high dimensional data (i.e., high number of features) while the similarity-based methods require significant computations to deal with high volume dataset (i.e., high number of samples).

References

[1]
Raymond Anderson. 2007. The credit scoring toolkit: theory and practice for retail credit risk management and decision automation. Oxford University Press.
[2]
Roberto Battiti. 1994. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on neural networks 5, 4 (1994), 537–550.
[3]
Verónica Bolón-Canedo, Noelia Sánchez-Maroño, and Amparo Alonso-Betanzos. 2015. Recent advances and emerging challenges of feature selection in the context of big data. Knowledge-based systems 86 (2015), 33–45.
[4]
Gavin Brown, Adam Pocock, Ming-Jie Zhao, and Mikel Luján. 2012. Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. The journal of machine learning research 13 (2012), 27–66.
[5]
Zheng Chen, Meng Pang, Zixin Zhao, Shuainan Li, Rui Miao, Yifan Zhang, Xiaoyue Feng, Xin Feng, Yexian Zhang, Meiyu Duan, 2020. Feature selection may improve deep neural networks for the bioinformatics problems. Bioinformatics 36, 5 (2020), 1542–1552.
[6]
Baljinder Ghotra, Shane McIntosh, and Ahmed E Hassan. 2017. A large-scale study of the impact of feature selection techniques on defect classification models. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 146–157.
[7]
D Asir Antony Gnana, S Appavu Alias Balamurugan, and E Jebamalar Leavline. 2016. Literature review on feature selection methods for high-dimensional data. International Journal of Computer Applications 136, 1(2016), 9–17.
[8]
Robert M Gray. 2011. Entropy and information theory. Springer Science & Business Media.
[9]
Peter E Hart, David G Stork, and Richard O Duda. 2000. Pattern classification. Wiley Hoboken.
[10]
Xiaofei He, Deng Cai, and Partha Niyogi. 2005. Laplacian score for feature selection. Advances in neural information processing systems 18 (2005).
[11]
Firuz Kamalov and Fadi Thabtah. 2017. A feature selection method based on ranked vector scores of features for classification. Annals of Data Science 4, 4 (2017), 483–502.
[12]
Nikita Kozodoi, Stefan Lessmann, Konstantinos Papakonstantinou, Yiannis Gatsoulis, and Bart Baesens. 2019. A multi-objective approach for profit-driven feature selection in credit scoring. Decision support systems 120 (2019), 106–117.
[13]
Stefan Lessmann, Bart Baesens, Hsin-Vonn Seow, and Lyn C Thomas. 2015. Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research 247, 1 (2015), 124–136.
[14]
David D Lewis. 1992. Feature selection and feature extraction for text categorization. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992.
[15]
Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino, Jiliang Tang, and Huan Liu. 2017. Feature selection: A data perspective. ACM computing surveys (CSUR) 50, 6 (2017), 1–45.
[16]
Hongliang Liang, Lu Sun, Meilin Wang, and Yuxing Yang. 2019. Deep learning with customized abstract syntax tree for bug localization. IEEE Access 7(2019), 116309–116320.
[17]
Dahua Lin and Xiaoou Tang. 2006. Conditional infomax learning: An integrated framework for feature extraction and fusion. In European conference on computer vision. Springer, 68–82.
[18]
Sebastián Maldonado, Juan Pérez, and Cristián Bravo. 2017. Cost-based feature selection for support vector machines: An application in credit scoring. European Journal of Operational Research 261, 2 (2017), 656–665.
[19]
David Alfred Ostrowski. 2014. Feature selection for twitter classification. In 2014 IEEE International Conference on Semantic Computing. IEEE, 267–272.
[20]
Kunal Pahwa and Neha Agarwal. 2019. Stock market analysis using supervised machine learning. In 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon). IEEE, 197–200.
[21]
Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence 27, 8(2005), 1226–1238.
[22]
Khairan D Rajab. 2017. New hybrid features selection method: A case study on websites phishing. Security and Communication Networks 2017 (2017).
[23]
Marko Robnik-Šikonja and Igor Kononenko. 2003. Theoretical and empirical analysis of ReliefF and RReliefF. Machine learning 53, 1 (2003), 23–69.
[24]
Lyn Thomas, Jonathan Crook, and David Edelman. 2017. Credit scoring and its applications. SIAM.
[25]
Shrawan Kumar Trivedi. 2020. A study on credit scoring modeling with different feature selection and machine learning approaches. Technology in Society 63(2020), 101413.

Cited By

View all
  • (2024)A Novel Dynamic Programming Method for Non-parametric Data DiscretizationIntelligent Information and Database Systems10.1007/978-981-97-4982-9_17(215-227)Online publication date: 16-Jul-2024

Index Terms

  1. Empirical Analysis of Filter Feature Selection Criteria on Financial Datasets

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Other conferences
        SoICT '22: Proceedings of the 11th International Symposium on Information and Communication Technology
        December 2022
        474 pages
        ISBN:9781450397254
        DOI:10.1145/3568562
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 01 December 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. feature scoring
        2. feature selection criteria
        3. financial datasets

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Conference

        SoICT 2022

        Acceptance Rates

        Overall Acceptance Rate 147 of 318 submissions, 46%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)18
        • Downloads (Last 6 weeks)2
        Reflects downloads up to 12 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)A Novel Dynamic Programming Method for Non-parametric Data DiscretizationIntelligent Information and Database Systems10.1007/978-981-97-4982-9_17(215-227)Online publication date: 16-Jul-2024

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media