8000 GitHub - SergeyFilipov/german-credit-automl-fairness: πŸ€– Credit risk prediction with AutoML in R + fairness analysis by gender. Includes SHAP explainability, parity metrics, and a clear risk-duration visualization.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

πŸ€– Credit risk prediction with AutoML in R + fairness analysis by gender. Includes SHAP explainability, parity metrics, and a clear risk-duration visualization.

Notifications You must be signed in to change notification settings

SergeyFilipov/german-credit-automl-fairness

Repository files navigation

German Credit Risk & Fairness Analysis πŸ§ πŸ“Š

πŸ€– Automated credit scoring using H2O AutoML with fairness evaluation by gender.
Includes data preprocessing, SHAP explainability, parity metrics, and visual risk profiling.
The project is fully written in R and structured for reproducibility.


πŸ“ Project Structure

  • data/ – raw input data (.csv)
  • output/ – generated outputs (plots, predictions)
  • src/ – core analysis script
  • README.md – project documentation

πŸ”’ Block 1: Load required packages and get script path

πŸ“„ src/analysis.R
This block loads required libraries and dynamically detects the working directory of the script to ensure all relative paths work properly.

if (!require("rstudioapi")) install.packages("rstudioapi")
library(rstudioapi)

script.dir <- dirname(rstudioapi::getSourceEditorContext()$path)

βœ… This makes the script self-contained and portable β€” no hardcoded file paths required.

πŸ”’ Block 2: Load and prepare the dataset

πŸ“„ src/analysis.R
πŸ“₯ german_credit_data_small.csv
This block loads the CSV file containing credit data and drops the unnecessary ID column.

data <- read.csv(file.path(script.dir, "german_credit_data_small.csv"))
data <- data[, 2:11]  # Drop ID column

πŸ“ The dataset contains information on credit applicants, including age, sex, job type, housing status, account balances, loan amount, duration, purpose, and risk classification. Removing the ID column helps ensure that modeling is not biased by an arbitrary identifier.

πŸ”’ Block 3: Recode categorical variables

πŸ“„ src/analysis.R
This block transforms categorical variables into numerical or factor encodings to prepare them for modeling.

library(dplyr)

data$Sex <- as.factor(recode(data$Sex, 'male' = 1, 'female' = 2))
data$Job <- as.factor(data$Job)
data$Housing <- as.factor(recode(data$Housing, 'free' = 1, 'rent' = 2, 'own' = 3))
data$Saving.accounts <- as.factor(recode(data$Saving.accounts, 'little' = 1, 'moderate' = 2, 
                                         'quite rich' = 3, 'rich' = 4))
data$Checking.account <- as.factor(recode(data$Checking.account, 'little' = 1, 'moderate' = 2))
data$Credit.amount <- as.numeric(data$Credit.amount)
data$Duration <- as.numeric(data$Duration)
data$Purpose <- as.factor(recode(data$Purpose, 'vacation/others' = 1, 'radio/TV' = 2, 
                                 'domestic appliances' = 3, 'furniture/equipment' = 4,
                                 'repairs' = 5, 'car' = 6, 'education' = 7, 'business' = 8))
data$Risk <- as.factor(recode(data$Risk, 'good' = 0, 'bad' = 1))

🎯 Variables like Sex, Housing, Purpose, Saving.accounts, and Risk are transformed into factors with consistent numeric codes. This ensures compatibility with machine learning algorithms and fairness analysis later in the script.

πŸ”’ Block 4: Handle missing values

πŸ“„ src/analysis.R
This block checks for missing values and imputes them using the MICE (Multivariate Imputation by Chained Equations) method.

library(mice)

# Check proportion of missing values
sum(is.na(data)) / (nrow(data) * ncol(data))

# Perform imputation
data <- complete(mice(data))

# Confirm that missing values are handled
sum(is.na(data)) / (nrow(data) * ncol(data))

πŸ“Š The script calculates the percentage of missing data and uses iterative chained equations to impute values. This helps maintain statistical integrity and prevents bias due to incomplete records.

πŸ”’ Block 5: Create and save scatter plot

πŸ“„ src/analysis.R
πŸ“ˆ dotplot_germancredit.png
This block creates a scatter plot visualizing the relationship between credit amount and duration, colored by credit risk and shaped by gender.

library(ggplot2)

ggplot(data = data) +
  geom_point(aes(x = Duration, y = Credit.amount, color = as.factor(Risk), shape = as.factor(Sex)), size = 2) +
  theme(text = element_text(size = 15))

ggsave(filename = file.path(script.dir, "dotplot_germancredit.png"), width = 5, height = 4.5, dpi = 100)

πŸ–ΌοΈ The plot helps visually identify how loan characteristics relate to credit risk and gender, highlighting potential biases or patterns in approvals.

πŸ”’ Block 6: Export processed dataset

πŸ“„ src/analysis.R
πŸ“€ german_credit_data_small_processed.csv
This block saves the cleaned and transformed dataset to a new CSV file for later modeling steps.

write.csv(data, file.path(script.dir, "german_credit_data_small_processed.csv"))

πŸ’Ύ This file serves as the final input for the AutoML pipeline. Exporting it separately ensures a clean separation between data preprocessing and model training logic.

πŸ”’ Block 7: Start H2O and load data

πŸ“„ src/analysis.R
This block initializes the H2O framework, converts the dataset into an H2O frame, and splits it into training and testing subsets.

library(h2o)
h2o.init()

# Convert data to H2O frame
frame <- as.h2o(data)

# Split into training (80%) and test (20%) sets
splits <- h2o.splitFrame(frame, ratios = 0.8)
train <- splits[[1]]
test <- splits[[2]]

πŸš€ The model will be trained on 80% of the data and validated on the remaining 20%. This split is essential for evaluating how well the model generalizes to unseen data.

πŸ”’ Block 8: Train model using H2O AutoML

πŸ“„ src/analysis.R
This block runs automated machine learning (AutoML) using H2O, training multiple models and selecting the best one based on performance.

y <- "Risk"

aml <- h2o.automl(
  y = y,
  training_frame = train,
  max_runtime_secs = 120,
  seed = 123,
  exclude_algos = "StackedEnsemble"
)

🧠 This greatly simplifies model selection and tuning by letting the AutoML engine explore different algorithms and hyperparameters. The best-performing model is saved as aml@leader.

πŸ”’ Block 9: Explain model behavior with SHAP

πŸ“„ src/analysis.R
This block explains the output of the model using SHAP (SHapley Additive exPlanations) values, which reveal the contribution of each feature to individual predictions.

exa <- h2o.explain(aml, test)
exm <- h2o.explain(aml@leader, test)

# SHAP explanation for a single observation
h2o.shap_explain_row_plot(aml@leader, test, row_index = 1)

πŸ” SHAP values provide transparency by quantifying the effect each feature has on a prediction. This is crucial in high-stakes areas like credit scoring, where model decisions must be explainable.

πŸ”’ Block 10: Fairness analysis by gender

πŸ“„ src/analysis.R
This final block evaluates the model's fairness using multiple bias criteria, comparing prediction outcomes between male and female applicants.

# Generate predictions
test2 <- test
test2$predict <- h2o.predict(aml@leader, newdata = test)

# Criterion 1: Statistical (demographic) parity
mean(as.numeric(test2[test2$Sex == 1, ]$predict) - 1)  # Male rejection rate
mean(as.numeric(test2[test2$Sex == 2, ]$predict) - 1)  # Female rejection rate

# Criterion 2: Odds ratio
(
  mean(as.numeric(test2[test2$Sex == 1, ]$predict) - 1) /
  mean(as.numeric(test2[test2$Sex == 2, ]$predict) - 1)
)

# Criterion 3: Predictive parity (true positive rate)
mean(as.numeric(test2[test2$Sex == 1 & test2$Risk == 1, ]$predict) - 1)
mean(as.numeric(test2[test2$Sex == 2 & test2$Risk == 1, ]$predict) - 1)

# Criterion 4: Accuracy equality
mean(as.numeric(test2[test2$Sex == 1, ]$Risk == test2[test2$Sex == 1, ]$predict))
mean(as.numeric(test2[test2$Sex == 2, ]$Risk == test2[test2$Sex == 2, ]$predict))

βš–οΈ Fairness criteria computed:

Statistical (demographic) parity – equal rejection rates across genders

Odds ratio – ratio of rejection probabilities

Predictive parity – equal true positive rates

Accuracy equality – equal prediction accuracy between groups

βœ… This analysis helps detect potential discrimination or unfair treatment based on gender.

πŸ“š Dataset Source

The dataset used in this project is a cleaned and preprocessed version of the original German Credit Risk dataset, published on Kaggle:

πŸ”— German Credit Risk Dataset – Kaggle
πŸ“Œ Originally sourced from the UCI Machine Learning Repository – Statlog (German Credit Data)

This dataset contains anonymized information on 1,000 credit applicants, including demographic and financial variables such as age, sex, job type, account status, loan purpose, and risk classification (good or bad). It is widely used in credit scoring research and fairness-aware machine learning.

About

πŸ€– Credit risk prediction with AutoML in R + fairness analysis by gender. Includes SHAP explainability, parity metrics, and a clear risk-duration visualization.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0