[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Very large-scale data classification based on K-means clustering and multi-kernel SVM

Published: 01 June 2019 Publication History

Abstract

When classifying very large-scale data sets, there are two major challenges: the first challenge is that it is time-consuming and laborious to label sufficient amount of training samples; the second challenge is that it is difficult to train a model in a time-efficient and high-accuracy manner. This is due to the fact that to create a high-accuracy model, normally it is required to generate a large and representative training set. A large training set may also require significantly more training time. There is a trade-off between the speed and accuracy when performing classification training, especially for large-scale data sets. To address this problem, a novel strategy of large-scale data classification is proposed by combining K-means clustering technology and multi-kernel support vector machine method. First, the K-means clustering method is used on a small portion of the original data set. The clustering stage is designed with a special strategy to select representative training instances. Such method reduces the needs of creating a large training set as well as the subsequent manual labeling work. K-means clustering method has two characteristics: (1) the result is greatly influenced by the cluster number k, and (2) the optimal result is difficult to achieve. In the proposed special strategy, the two characteristics are utilized to find the most representative instances by defining a relaxed cluster number k and doing K-means repeatedly. In each K-means clustering step, both the nearest and the farthest instance to each cluster center are selected into a set. Using this method, the selected instances will have a representative distribution of the original whole data set and reduce the need of labeling the original data set. An outlier detection method is applied to further delete the outlier instances according to their outlier scores. Finally, a multi-kernel SVM is trained using the selected instances and a classifier model can be obtained to predict subsequent new instances. The evaluation results show that the proposed instance selection method significantly reduces the size of training data sets as well as training time; in the meanwhile, it maintains a relatively good accuracy performance.

References

[1]
Alcalá-Fdez J, Fernández A, Luengo J, Derrac J, García S, Sánchez L, Herrera F (2011) Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Multiple-Valued Logic Soft Comput 17:255-287.
[2]
Arnaiz-González Á, Díez-Pastor J-F, Rodríguez JJ, García-Osorio C (2016) Instance selection of linear complexity for big data. Knowl Based Syst 107:83-95.
[3]
Bottou L, Lin C-J (2007) Support vector machine solvers. Large Scale Kernel Mach 3(1):301-320.
[4]
Cavalcanti GDC, Ren TI, Pereira CL (2013) ATISA: adaptive threshold-based instance selection algorithm. Expert Syst Appl 40(17):6894- 6900.
[5]
Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2(3):27.
[6]
Chen H, Zhang Y, Gutman I (2016) A kernel-based clustering method for gene selection with gene expression data. J Biomed Inform 62:12-20.
[7]
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107-113.
[8]
Dornaika F, Aldine IK (2015) Decremental sparse modeling representative selection for prototype selection. Pattern Recogn 48(11):3714-3727.
[9]
Hamidzadeh J, Monsefi R, Yazdi HS (2016) Large symmetric margin instance selection algorithm. Int J Mach Learn Cybern 7(1):25-45.
[10]
Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Discov 2(3):283-304.
[11]
Khosravani HR, Ruano AE, Ferreira PM (2016) A convex hull-based data selection method for data driven models. Appl Soft Comput 47:515-533.
[12]
Kim MS (2013) Robust, scalable anomaly detection for large collections of images. In: 2013 International conference on social computing (SocialCom), pp 1054-1058. IEEE.
[13]
Lichman M (2013) UCI machine learning repository. University of California, School of Information and Computer Science, Irvine, CA. http://archive.ics.uci.edu/ml.
[14]
Lin W-C, Tsai C-F, Ke S-W, Hung C-W, Eberle W (2015) Learning to detect representative data for large scale instance selection. J Syst Softw 106:1-8.
[15]
Liu X, Wang L, Yin J, Liu L (2012) Incorporation of radius-info can be simple with SimpleMKL. Neurocomputing 89:30-38.
[16]
Liu X, Zhou L, Wang L, Zhang J, Yin J, Shen D (2015) An efficient radius-incorporated MKL algorithm for Alzheimer's disease prediction. Pattern Recogn 48(7):2141-2150.
[17]
Neugebauer J, Kramer O, Sonnenschein M (2016) Improving cascade classifier precision by instance selection and outlier generation. In: ICAART, no. 2, pp 96-104.
[18]
Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF (2010) A new fast prototype selection method based on clustering. Pattern Anal Appl 13(2):131-141.
[19]
Onan A (2015) A fuzzy-rough nearest neighbor classifier combined with consistency-based subset evaluation and instance selection for automated diagnosis of breast cancer. Expert Syst Appl 42(20):6844-6852.
[20]
Rakotomamonjy A, Bach FR, Canu S, Grandvalet Y (2008) SimpleMKL. J Mach Learn Res 9(Nov):2491-2521.
[21]
Rezaei M, Nezamabadi-Pour H (2015) Using gravitational search algorithm in prototype generation for nearest neighbor classification. Neurocomputing 157:256-263.
[22]
Silva DANS, Souza LC, Motta GHMB (2016) An instance selection method for large datasets based on Markov geometric diffusion. Data Knowl Eng 101:24-41.
[23]
Stojanovi? MB, Bo?i? MM, Stankovi? MM, Staji? ZP (2014) A methodology for training set instance selection using mutual information in time series prediction. Neurocomputing 141:236-245.
[24]
Sun J, Li H (2011) Dynamic financial distress prediction using instance selection for the disposal of concept drift. Expert Syst Appl 38(3):2566-2576.
[25]
Triguero I, Derrac JN, GarcíA S, Herrera F (2012) Integrating a differential evolution feature weighting scheme into prototype generation. Neurocomputing 97:332-343.
[26]
Valero-Mas JJ, Calvo-Zaragoza J, Rico-Juan JR (2016) On the suitability of prototype selection methods for kNN classification with distributed data. Neurocomputing 203:150-160.
[27]
Whelan M, Le Khac NA, Kechadi M-T (2010) Data reduction in very large spatio-temporal datasets. In: 2010 19th IEEE International workshop on enabling technologies: infrastructures for collaborative enterprises (WETICE). IEEE, pp 104-109.
[28]
Wilson DR, Martinez TR (2000) Reduction techniques for instance-based learning algorithms. Mach Learn 38(3):257-286.
[29]
Wu P, Duan F, Guo P (2015) A pre-selecting base kernel method in multiple kernel learning. Neurocomputing 165:46-53.
[30]
Zhai J, Wang X, Pang X (2016) Voting-based instance selection from large data sets with MapReduce and random weight networks. Inf Sci 367:1066-1077.

Cited By

View all
  • (2024)Fast and De-noise Instance Selection Method for SVMs Training Based on Clustering and Intuitionistic Fuzzy NumberAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5678-0_26(299-311)Online publication date: 5-Aug-2024
  • (2023)Simulation model and fault analysis of air circulation system of the aircraft based on grasshopper optimization algorithm: support vector machineSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-022-07403-227:18(13269-13284)Online publication date: 1-Sep-2023
  • (2022)Analysis of hyperspectral images for detection of drought stress and recovery in maize plants in a high-throughput phenotyping platformComputers and Electronics in Agriculture10.1016/j.compag.2019.05.018162:C(749-758)Online publication date: 20-Apr-2022
  • Show More Cited By
  1. Very large-scale data classification based on K-means clustering and multi-kernel SVM

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image Soft Computing - A Fusion of Foundations, Methodologies and Applications
        Soft Computing - A Fusion of Foundations, Methodologies and Applications  Volume 23, Issue 11
        June 2019
        332 pages
        ISSN:1432-7643
        EISSN:1433-7479
        Issue’s Table of Contents

        Publisher

        Springer-Verlag

        Berlin, Heidelberg

        Publication History

        Published: 01 June 2019

        Author Tags

        1. K-means clustering
        2. Multi-kernel SVM
        3. Outlier detection
        4. Very large-scale classification

        Qualifiers

        • Article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 21 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Fast and De-noise Instance Selection Method for SVMs Training Based on Clustering and Intuitionistic Fuzzy NumberAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5678-0_26(299-311)Online publication date: 5-Aug-2024
        • (2023)Simulation model and fault analysis of air circulation system of the aircraft based on grasshopper optimization algorithm: support vector machineSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-022-07403-227:18(13269-13284)Online publication date: 1-Sep-2023
        • (2022)Analysis of hyperspectral images for detection of drought stress and recovery in maize plants in a high-throughput phenotyping platformComputers and Electronics in Agriculture10.1016/j.compag.2019.05.018162:C(749-758)Online publication date: 20-Apr-2022
        • (2022)Novel non-Kernel quadratic surface support vector machines based on optimal margin distributionSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-022-07354-826:18(9215-9227)Online publication date: 1-Sep-2022
        • (2021)Multi-attribute overlapping radar working pattern recognition based on K-NN and SVM-BPThe Journal of Supercomputing10.1007/s11227-021-03660-477:9(9642-9657)Online publication date: 1-Sep-2021
        • (2020)High-Performance Machine Learning for Large-Scale Data Classification considering Class ImbalanceScientific Programming10.1155/2020/19534612020Online publication date: 18-May-2020
        • (2020)Hybrid machine learning for predicting strength of sustainable concreteSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-020-04848-124:19(14965-14980)Online publication date: 11-Mar-2020
        • (2019)A Data Compacting Technique to reduce the NetFlow size in Botnet Detection with BotClusterProceedings of the 6th IEEE/ACM International Conference on Big Data Computing, Applications and Technologies10.1145/3365109.3368778(81-84)Online publication date: 2-Dec-2019
        • (2019)Multiobjective evolutionary-based multi-kernel learner for realizing transfer learning in the prediction of HIV-1 protease cleavage sitesSoft Computing - A Fusion of Foundations, Methodologies and Applications10.1007/s00500-019-04487-124:13(9727-9751)Online publication date: 5-Nov-2019

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media