8000 GitHub - artefactory/mgs-grf
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
8000

artefactory/mgs-grf

Repository files navigation

Harnessing Mixed Features for Imbalance Data Oversampling: Application to Bank Customers Scoring

Abdoulaye SAKHO1, 2, Emmanuel MALHERBE1, Carl-Erik GAUTHIER3, Erwan SCORNET2
1 Artefact Research Center,
2 LPSM - Sorbonne Université, 3 Société Générale

Preprint.
[Full Paper]

Abstract: This study investigates rare event detection on tabular data within binary classification. Many real-world classification tasks, such as in banking sector, deal with mixed features, which have a significant impact on predictive performances. To this purpose, we introduce MGS-GRF, an oversampling strategy designed for mixed features. This method uses a kernel density estimator with locally estimated full-rank covariances to generate continuous features, while categorical ones are drawn from the original samples through a generalized random forest.

You will find code to reproduce the paper experiments as well as an nice implementation of our new and efficient strategy for your projects.

⭐ Table of Contents

⭐ How to use the MGS-GRF Algorithm to learn on imbalanced data

Here is a short example on how to use MGS-GRF:

from mgs_grf import MGSGRFOverSampler

## Apply MGS-GRF procedure to oversample the data
mgs_grf = MGSGRFOverSampler(K=len(numeric_features),categorical_features=categorical_features,random_state=0)
balanced_X, balanced_y = mgs_grf.fit_resample(X_train,y_train)
print("Augmented data : ", Counter(balanced_y))

## Encode the categorical variables
enc = OneHotEncoder(handle_unknown='ignore',sparse_output=False)
balanced_X_encoded = enc.fit_transform(balanced_X[:,categorical_features])
balanced_X_final = np.hstack((balanced_X[:,numeric_features],balanced_X_encoded))

# Fit the final classifier on the augmented data
clf_mgs = lgb.LGBMClassifier(n_estimators=100,verbosity=-1, random_state=0)
clf_mgs.fit(balanced_X_final, balanced_y)

A more detailed notebook example is available here.

⭐ Reproducing the paper experiments

If you want to reproduce our paper experiments:

  • Section 4.2 : the Python file reproduce the experiments (data sets, oversampling and traing). Then the results can be analyzed with this notebook.
  • Section 4.3 : the Python file reproduce the experiments (data sets, oversampling and traing). Then the results can be analyzed with this notebook.
  • Section 5 : the Python file reproduce the experiments (data sets, oversampling and traing). Then the results can be analyzed with this notebook.

⭐ Data sets

The data sets of used for our article should be dowloaded inside the data/externals folder. The data sets are available at the followings adresses :

⭐ Acknowledgements

This work was done through a partenership between Artefact Research Center and the Laboratoire de Probabilités Statistiques et Modélisation (LPSM) of Sorbonne University.

   

⭐ Citation

If you find the code usefull, please consider citing us :

@article{sakho2025harnessing,
  title={Harnessing Mixed Features for Imbalance Data Oversampling: Application to Bank Customers Scoring},
  author={Sakho, Abdoulaye and Malherbe, Emmanuel and Gauthier, Carl-Erik and Scornet, Erwan},
  journal={arXiv preprint arXiv:2503.22730},
  year={2025}
}

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  
0