Abdoulaye SAKHO1, 2, Emmanuel MALHERBE1, Carl-Erik GAUTHIER3, Erwan SCORNET2
1 Artefact Research Center,
2 LPSM - Sorbonne Université, 3 Société Générale
Preprint.
[Full Paper]
Abstract: This study investigates rare event detection on tabular data within binary classification. Many real-world classification tasks, such as in banking sector, deal with mixed features, which have a significant impact on predictive performances. To this purpose, we introduce MGS-GRF, an oversampling strategy designed for mixed features. This method uses a kernel density estimator with locally estimated full-rank covariances to generate continuous features, while categorical ones are drawn from the original samples through a generalized random forest.
You will find code to reproduce the paper experiments as well as an nice implementation of our new and efficient strategy for your projects.
Here is a short example on how to use MGS-GRF:
from mgs_grf import MGSGRFOverSampler
## Apply MGS-GRF procedure to oversample the data
mgs_grf = MGSGRFOverSampler(K=len(numeric_features),categorical_features=categorical_features,random_state=0)
balanced_X, balanced_y = mgs_grf.fit_resample(X_train,y_train)
print("Augmented data : ", Counter(balanced_y))
## Encode the categorical variables
enc = OneHotEncoder(handle_unknown='ignore',sparse_output=False)
balanced_X_encoded = enc.fit_transform(balanced_X[:,categorical_features])
balanced_X_final = np.hstack((balanced_X[:,numeric_features],balanced_X_encoded))
# Fit the final classifier on the augmented data
clf_mgs = lgb.LGBMClassifier(n_estimators=100,verbosity=-1, random_state=0)
clf_mgs.fit(balanced_X_final, balanced_y)
A more detailed notebook example is available here.
If you want to reproduce our paper experiments:
- Section 4.2 : the Python file reproduce the experiments (data sets, oversampling and traing). Then the results can be analyzed with this notebook.
- Section 4.3 : the Python file reproduce the experiments (data sets, oversampling and traing). Then the results can be analyzed with this notebook.
- Section 5 : the Python file reproduce the experiments (data sets, oversampling and traing). Then the results can be analyzed with this notebook.
The data sets of used for our article should be dowloaded inside the data/externals folder. The data sets are available at the followings adresses :
This work was done through a partenership between Artefact Research Center and the Laboratoire de Probabilités Statistiques et Modélisation (LPSM) of Sorbonne University.
If you find the code usefull, please consider citing us :
@article{sakho2025harnessing,
title={Harnessing Mixed Features for Imbalance Data Oversampling: Application to Bank Customers Scoring},
author={Sakho, Abdoulaye and Malherbe, Emmanuel and Gauthier, Carl-Erik and Scornet, Erwan},
journal={arXiv preprint arXiv:2503.22730},
year={2025}
}