GitHub - fnyamweya/fraud-detection

Fraud Detection

This project implements a machine learning-based fraud detection system using the Credit Card Fraud Detection Dataset. The system uses logistic regression and random forest classifiers to identify fraudulent transactions. The models are optimized using grid search with cross-validation and evaluated using various performance metrics.

Features

Two models: Logistic Regression and Random Forest classifiers.
Hyperparameter tuning: Uses GridSearchCV to optimize model parameters.
Feature Engineering: Adds new features based on transaction amounts and time-based features.
Model Evaluation: Precision, recall, F1-score, and confusion matrix to assess model performance.
Class Imbalance Handling: Ability to handle imbalanced data with techniques like class_weight='balanced' and SMOTE oversampling.
Interactive Model Selection: Users can select the model (Logistic Regression or Random Forest) when running the script.

Installation

Prerequisites

Python 3.8+
Libraries:
pandas
numpy
scikit-learn
matplotlib
joblib
imblearn (for SMOTE, if using oversampling)
xgboost (optional)

Setting up the environment

Clone the repository:

git clone git@github.com:fnyamweya/fraud-detection.git
cd fraud-detection

Install dependencies: It's recommended to use a virtual environment.

python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt

Download the dataset: You can download the dataset from Kaggle's Credit Card Fraud Detection dataset, or if you have KaggleHub set up, use:

import kagglehub
kagglehub.dataset_download("mlg-ulb/creditcardfraud")

Place the dataset in the data/ folder:

mv path/to/downloaded/creditcard.csv data/

Usage

Training and Evaluating Models

You can train and evaluate either the Logistic Regression or Random Forest model by running the script:

python main.py

Once the script starts, it will prompt you to choose which model to train:

Choose a model to train:
1. Random Forest
2. Logistic Regression
Enter 1 or 2:

After training, the model will output evaluation metrics, including accuracy, precision, recall, F1-score, and confusion matrix.

Adjusting Model Sensitivity

To adjust the model's sensitivity to fraud detection (increase recall and detect more fraud cases), you can manually adjust the decision threshold. By default, logistic regression uses a 0.5 threshold. You can adjust this to make the model more sensitive.

# Example of evaluating the model with a threshold of 0.4
evaluate_model_with_threshold(model, X_test, y_test, threshold=0.4)

Handling Class Imbalance

Class Weighting: You can train the models with class_weight='balanced' to handle class imbalance automatically.
Oversampling with SMOTE:

To resample the training data using SMOTE, use the following code before training the model:

from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

Confusion Matrix

The confusion matrix will be plotted after the model evaluation. It shows the true positive, false positive, true negative, and false negative counts, helping you visualize the model’s performance on fraud detection.

For example:

True Negatives: Correctly predicted non-fraud cases.
False Positives: Incorrectly predicted fraud cases when they are actually non-fraud.
False Negatives: Missed fraud cases.
True Positives: Correctly predicted fraud cases.

Models

Logistic Regression: A linear model that is simple but effective for fraud detection when well-tuned.
Random Forest: A non-linear ensemble model that typically offers more flexibility and better performance for detecting complex patter 579F ns.

Both models are saved in the models/ directory after training:

models/logistic_regression_model.pkl
models/random_forest_model.pkl

Evaluation Metrics

The models are evaluated using the following metrics:

Precision: Measures how many of the predicted frauds are actually fraud.
Recall: Measures how many of the actual frauds were detected by the model.
F1-score: Harmonic mean of precision and recall, providing a single measure of the model's accuracy.
Confusion Matrix: Visualizes the performance by showing true positives, false positives, true negatives, and false negatives.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
requirements_locked.txt		requirements_locked.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fraud Detection

Table of Contents

Features

Installation

Prerequisites

Setting up the environment

Usage

Training and Evaluating Models

Adjusting Model Sensitivity

Handling Class Imbalance

Confusion Matrix

Models

Evaluation Metrics

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

fnyamweya/fraud-detection

Folders and files

Latest commit

History

Repository files navigation

Fraud Detection

Table of Contents

Features

Installation

Prerequisites

Setting up the environment

Usage

Training and Evaluating Models

Adjusting Model Sensitivity

Handling Class Imbalance

Confusion Matrix

Models

Evaluation Metrics

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages