Keywords

1 Introduction

The goal of the supervised classification is to build a mathematical model of a real-life problem using a labeled dataset. This mathematical model is used to assign the class label to each new recognized object, which, in general, does not belong to the training set. The individual classification model is called a base classifier. Ensemble methods are a vastly used approach to improve the possibilities of base classifiers by building a more stable and accurate ensemble of classifiers (EoC) [23, 28]. In general, the procedure for building EoC consists of three steps: generation, selection, and integration [18]. An imbalanced data problem occurs when the prior probability of classes in a given dataset is very diverse. There are many real-life problems in which we deal with imbalanced data [11, 25], e.g., network intrusion detection [2, 14], source code fault detection [8], or in general fraud detection [1].

There exist two main approaches to solve the problem of imbalanced data: a data-level [9, 26] and an algorithm-level solution [29]. EoC is one of the approaches to solve the imbalanced data classification problem which improve classification measure compared to single models and is highly competitive and robust to imbalanced data [10, 13, 16]. The use of not only voting in the EoC integration phase is one of the directions to solve a problem with imbalanced data [15]. Therefore, this article concerns about calculating the weighted scoring function to be applied in the weighted voting process.

In the process of EoC generation, we use the K-Means clustering algorithm [5] for each class label separately. The base linear classifier – Support Vector Machine [7] – is trained on cluster combination. The weighted scoring function takes into account the distance of a classified object from the decision boundary and cluster centroids used to learn the proper base classifier. Regardless of the number of learning objects in a given cluster, the cluster centroid is always determined. The proposed method for determining the scoring function is, therefore, insensitive to the number of objects defining the cluster. As shown in the article, the proposed approach is useful for imbalanced data.

The main objectives of this work are summarized as follows:

  • A proposal of a new weighted scoring function that uses the location of the cluster centroids and distance to the decision boundary.

  • The proposition of an algorithm that uses clustering and the proposed function.

  • A new experimental setup to compare the proposed method with other algorithms on highly imbalanced datasets.

The paper is structured as follows: Sect. 2 introduces the base concept of EoC and presents the proposed algorithm. In Sect. 3, the experiments that were carried out are presented, while results and the discussion appear in Sect. 4. Finally, we conclude the paper in Sect. 5.

2 Clustering and Weighted Scoring

2.1 Basic Notation

From a probabilistic point of view the recognition algorithm \(\varPsi \) maps the feature space \(\mathcal {X} \subseteq \mathfrak {R}^d\) to the set of class labels \(\varOmega =\{\omega _1,\omega _2,\ldots ,\omega _C\}\) according to the general formula:

$$\begin{aligned} \varPsi :\mathcal {X} \rightarrow \varOmega . \end{aligned}$$
(1)

For the feature vector \(x \in \mathcal {X}\), that represents the recognized object the Eq. (1) can be expressed as:

$$\begin{aligned} \varPsi (x)=\omega _c. \end{aligned}$$
(2)

Let us assume that L different base classifiers \(\varPsi _1,\varPsi _2,\ldots ,\varPsi _L\), are employed to solve the classification task. This set of classifiers defines EoC. One of the most popular methods to integrate outputs of the base classifiers set L is the majority vote rule. In this method, each base model has the same impact on the final decision of EoC. This method allows counting base classifiers outputs as a vote for a class and assigns the input pattern to the class with the greatest count of votes. It is defined as follows:

$$\begin{aligned} \varPsi _{MV}(x)=\arg \max _{\omega _c}\sum _{k=1}^{L}I(\varPsi _k(x)=\omega _c), \end{aligned}$$
(3)

where \(I(\cdot )\) is the indicator function.

In the weighted majority voting rule, the integration phase includes probability estimators or other factors of base models to the final decision of EoC [19], like in Eq. 4:

$$\begin{aligned} \varPsi _{MV}(x)=\arg \max _{\omega _c}\sum _{k=1}^{L}w_kI(\varPsi _k(x)=\omega _c), \end{aligned}$$
(4)

where \(w_k\) is the weight assigned to the classifier \(\varPsi _k\).

Over the last years, the issue of calculating the weights in the voting rule has been considered many times. The article [30] presents an approach in which the weights are combining with local confidence. The classifier trained on a subset of training data should be limited to the area it spans in an impact on the resulting classifier. The problem of generalization of majority voting was studied in [12]. The authors are using a probability estimate calculated as percentage of properly classified validation objects over geometric constraints. Separately, regions that are functionally independent are considered. A significant improvement in the classification quality was observed when using the proposed algorithm, although knowledge of the domain is needed to provide a proper division. The authors are using a retinal image and classify over anatomic regions.

The weights of the base classifier are also considered in the context of the interval-valued fuzzy sets [6]. The upper weight of base the classifier refers to the situation in which the definite base classifier was correct, while the other classifiers proved the correct prediction. The lower weight describes the situation in which the definite base classifier made errors, while the other classifiers didn’t make any errors.

In the paper [22] weights are determined for each label separately over the entire validation dataset. This can lead to the improvement of the performance of the resulting integrated classifier.

The following article is a proposition of an algorithm assigning weights not to base classifiers, but recognized objects. The weight for each object depends on its location in the feature space. Therefore, the weight of an object is determined by the score function calculated in the geometric space.

2.2 Proposed Method

We propose that the score function of the object x depends on its position in the geometric space. In particular, the distance from the decision boundary of the base classifier \(\varPsi _l\) from EoC and the cluster centroids used to learn this classifier are used to calculated the unweighted score function:

$$\begin{aligned} sf_l(x) = \Vert \psi _l(x)\Vert + \sum _{c=1}^{C} d_c, \end{aligned}$$
(5)

where \(d_c\) is the distance from x to cluster centroid in Manhattan metric. This metric was chosen because of the lowest calculation cost among all the considered alternatives. The calculation of the distance occurs between the centroids of all clusters and all tested patterns, so the computational complexity of the prediction procedure is very much dependent on the chosen metric, which is the reason for minimizing its impact.

We propose the following scoring weighting method:

$$\begin{aligned} wsf_l(x) = 1 - \frac{sf_l(x)}{\sum _{l=1}^{L} sf_l(x)}, \end{aligned}$$
(6)

which includes all scoring functions obtained for each classifier from EoC.

Figure 1 shows how to calculate the object’s score function for a linear dichotomic classifier and two cluster centroids. A solid red line marks the decision boundary of the base linear classifier \(\varPsi _l\) constructed for the selected class cluster combination – \(C^{\omega _1}\) and \(C^{\omega _2}\). Blue points are cluster centroids. The sum of the dashed sections indicates the value of the score function for the tested object x. The red dashed line – the distance to decision boundary \(\Vert \psi _l(x)\Vert \), the blue dashed line – the distance to cluster centroids determined by the Manhattan metric \(d_1\) and \(d_2\).

Fig. 1.
figure 1

Schema for calculating of the score function for the object x.

figure a

Algorithm 1 presents the pseudocode of the proposed approach to EoC with clustering and weighted scoring in the geometric space. In addition, Algorithm 1 concerns the dichotomous division of the learning set into class labels. These types of highly imbalanced datasets were used in the experimental research.

3 Experiments Set-Up

The experimental evaluation conducted for the needs of verification of the method proposed in the following work was based of 30 highly imbalanced datasets contained in the keel repository [3]. Datasets selected for the study are characterized by an imbalance ratio (the proportion between minority and majority class) ranging from 1:9 up to 1:40. Besides, due to the preliminary nature of the conducted research, the pool of datasets includes only binary classification problems.

The basis of the used division methodology was Stratified K-Fold Crossvalidation with \(k=5\), necessary to ensure the presence of minority class patterns in each of the analyzed training subsets. Statistical tests, for both pair and rank tests, were carried out using the Wilcoxon test with the significance level \(\alpha =0.05\) [4].

The analysis was conducted following the four classification approaches:

  • (svc) Support Vector Machine—the base experimental model with the scaled gamma and linear kernel [21].

  • (cws) Clustering and Weighted ScoringEoC with the pool diversified by pairs of clusters and integrated geometrically by the rules introduced in Sect. 2.

  • (cmv) Clustering and Majority VoteEoC identical with cws but integrated using the majority vote [24].

  • (csa) Clustering and Support AccumulationEoC identical with cws and cmv but integrated using the support accumulation rule [27].

Table 1. Results achieved by the analyzed method for the balanced accuracy score metric.

In the construction of each EoC, in order to limit the number of the presented tables and readability of the analysis, each time we construct the ensemble by dividing classes into two clusters, thus building a pool of four members. In the case of data as strongly imbalanced as those from the selected databases, often only a few (literally four or five) minority class objects remain in a single fold so that a more substantial number would treat almost every minority object as a separate cluster.

The whole experimental evaluation was performed in Python, using the scikit-learn api [20] to implement the cws method and is publicly available on the git repositoryFootnote 1. As metrics for the conducted analysis, due to the imbalanced nature of the classification problem, three aggregate measures (balanced accuracy score, F1-score, and G-mean) and three base measures constituting their calculation (precision, recall, and specificity) were applied, using their implementation included in the stream-learn package [17].

4 Experimental Evaluation

For the readability of the analysis, the full results of the experiment, along with the presentation of the relation between the algorithms resulting from the conducted paired tests, are presented only for the balanced accuracy score (Table 1) and recall (Table 2) metrics.

As may be observed, for aggregate metrics (results are consistent for both balanced accuracy and G-mean, only in F1-score presenting a slightly smaller scale of differences) the use of majority voting (cmv) for EoC diversified with clustering, often leads to deterioration of the classification quality even concerning a single base classifier. Integration by support accumulation (csa) performs slightly better, due to taking into consideration the certainty (support) of the decisions of each classifier but ignoring their areas of competence. The use of areas of competence present in the cws method allows for substantial improvement in classification results, often leading to a statistically significant advantage. The primary source of advantage in results is a significant improvement in the recall metric in this approach.

These observations become even clearer when we look at the results of the ranking tests presented in Table 3. In the case of balanced accuracy and G-mean, the cws method is statistically significantly better than in all other cases. For the F1-score metric, despite the numerical advantage, the statistical significance of the base method disappears, which is due to the symmetry of the F1-score metric relative to problem classes, which accepts the equal cost of a wrong decision to the minority and majority class.

Table 2. Results achieved by the analyzed method for the recall metric.
Table 3. Results for mean ranks achieved by analyzed methods with all metrics included in the evaluation.

The design of the recognition algorithm dedicated to imbalanced data is almost always based on the calibration of factors measurable by the base classification metrics. As in the case of the cws method, we try to increase the recall so that the inevitable reduction of precision or specificity will further give us a significant statistical advantage in the chosen aggregate metric, selected to define the cost of the incorrect classification that is relevant to us. In this case, the cws method turns out to be much better than the other methods of EoC integration with a pool diversified by clusters and allows a statistically significant improvement of the base method in the case of highly imbalanced data.

5 Conclusions

This paper presented a new clustering and weighted scoring algorithm dedicated to constructing EoC. We proposed that the scoring function should take into account the distance from the decision boundary of each base classifier and the cluster centroids necessary to learn this classifier. In the proposed weighting scoring function the distance to the decision boundary and sum of the distances to the centroids have the same weight. The proposed approach applies to imbalanced datasets because each cluster centroid can be calculated regardless of the number of objects in this cluster.

Comprehensive experiments are presented on thirty examples of highly imbalanced datasets. The obtained results show that the proposed algorithm is better than other algorithms in the context of statistical tests and some performance measures. In particular, in the case of the balanced accuracy and G-mean classification measures, the proposed in this paper method is statistically significantly better than all others methods used in the experiments.

A possible future work is to be considered: other distance measures to calculate the distance to cluster centroids, the impact of the use of heterogeneous base classifiers in the proposed method of building EoC or another scoring weighting method. In particular, we suggest that the distance from the decision boundary can be scaled or weighting factors regarding the distance of the object from the decision boundary and the distance from the cluster centroids can be introduced. Additionally, we can assign weights for cluster centroids depending on the number of objects that were used to determine these centroids.