Harnessing Mixed Features for Imbalance Data Oversampling: Application to Bank Customers Scoring

Abdoulaye SAKHO^{1, 2}, Emmanuel MALHERBE¹, Carl-Erik GAUTHIER³, Erwan SCORNET²
¹ _{Artefact Research Center,}
² _{LPSM - Sorbonne Université,} ³ _{Société Générale}

Preprint.
[Full Paper]

Abstract: This study investigates rare event detection on tabular data within binary classification. Many real-world classification tasks, such as in banking sector, deal with mixed features, which have a significant impact on predictive performances. To this purpose, we introduce MGS-GRF, an oversampling strategy designed for mixed features. This method uses a kernel density estimator with locally estimated full-rank covariances to generate continuous features, while categorical ones are drawn from the original samples through a generalized random forest.

You will find code to reproduce the paper experiments as well as an nice implementation of our new and efficient strategy for your projects.

⭐ How to use the MGS-GRF Algorithm to learn on imbalanced data

Here is a short example on how to use MGS-GRF:

from mgs_grf import MGSGRFOverSampler

## Apply MGS-GRF procedure to oversample the data
mgs_grf = MGSGRFOverSampler(K=len(numeric_features),categorical_features=categorical_features,random_state=0)
balanced_X, balanced_y = mgs_grf.fit_resample(X_train,y_train)
print("Augmented data : ", Counter(balanced_y))

## Encode the categorical variables
enc = OneHotEncoder(handle_unknown='ignore',sparse_output=False)
balanced_X_encoded = enc.fit_transform(balanced_X[:,categorical_features])
balanced_X_final = np.hstack((balanced_X[:,numeric_features],balanced_X_encoded))

# Fit the final classifier on the augmented data
clf_mgs = lgb.LGBMClassifier(n_estimators=100,verbosity=-1, random_state=0)
clf_mgs.fit(balanced_X_final, balanced_y)

A more detailed notebook example is available here.

⭐ Reproducing the paper experiments

If you want to reproduce our paper experiments:

Section 4.2 : the Python file reproduce the experiments (data sets, oversampling and traing). Then the results can be analyzed with this notebook.
Section 4.3 : the Python file reproduce the experiments (data sets, oversampling and traing). Then the results can be analyzed with this notebook.
Section 5 : the Python file reproduce the experiments (data sets, oversampling and traing). Then the results can be analyzed with this notebook.

⭐ Data sets

The data sets of used for our article should be dowloaded inside the data/externals folder. The data sets are available at the followings adresses :

⭐ Acknowledgements

This work was done through a partenership between Artefact Research Center and the Laboratoire de Probabilités Statistiques et Modélisation (LPSM) of Sorbonne University.

⭐ Citation

If you find the code usefull, please consider citing us :

@article{sakho2025harnessing,
  title={Harnessing Mixed Features for Imbalance Data Oversampling: Application to Bank Customers Scoring},
  author={Sakho, Abdoulaye and Malherbe, Emmanuel and Gauthier, Carl-Erik and Scornet, Erwan},
  journal={arXiv preprint arXiv:2503.22730},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
data		data
example		example
mgs_grf		mgs_grf
protocols		protocols
validation		validation
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Harnessing Mixed Features for Imbalance Data Oversampling: Application to Bank Customers Scoring

⭐ Table of Contents

⭐ How to use the MGS-GRF Algorithm to learn on imbalanced data

⭐ Reproducing the paper experiments

⭐ Data sets

⭐ Acknowledgements

⭐ Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

artefactory/mgs-grf

Folders and files

Latest commit

History

Repository files navigation

Harnessing Mixed Features for Imbalance Data Oversampling: Application to Bank Customers Scoring

⭐ Table of Contents

⭐ How to use the MGS-GRF Algorithm to learn on imbalanced data

⭐ Reproducing the paper experiments

⭐ Data sets

⭐ Acknowledgements

⭐ Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages