This project implements a machine learning-based fraud detection system using the Credit Card Fraud Detection Dataset. The system uses logistic regression and random forest classifiers to identify fraudulent transactions. The models are optimized using grid search with cross-validation and evaluated using various performance metrics.
- Features
- Installation
- Usage
- Models
- Evaluation Metrics
- Adjusting Model Sensitivity
- Handling Class Imbalance
- Confusion Matrix
- License
- Two models: Logistic Regression and Random Forest classifiers.
- Hyperparameter tuning: Uses
GridSearchCV
to optimize model parameters. - Feature Engineering: Adds new features based on transaction amounts and time-based features.
- Model Evaluation: Precision, recall, F1-score, and confusion matrix to assess model performance.
- Class Imbalance Handling: Ability to handle imbalanced data with techniques like
class_weight='balanced'
and SMOTE oversampling. - Interactive Model Selection: Users can select the model (Logistic Regression or Random Forest) when running the script.
- Python 3.8+
- Libraries:
pandas
numpy
scikit-learn
matplotlib
joblib
imblearn
(for SMOTE, if using oversampling)xgboost
(optional)
- Clone the repository:
git clone git@github.com:fnyamweya/fraud-detection.git
cd fraud-detection
- Install dependencies: It's recommended to use a virtual environment.
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
- Download the dataset: You can download the dataset from Kaggle's Credit Card Fraud Detection dataset, or if you have KaggleHub set up, use:
import kagglehub
kagglehub.dataset_download("mlg-ulb/creditcardfraud")
- Place the dataset in the
data/
folder:
mv path/to/downloaded/creditcard.csv data/
You can train and evaluate either the Logistic Regression or Random Forest model by running the script:
python main.py
Once the script starts, it will prompt you to choose which model to train:
Choose a model to train:
1. Random Forest
2. Logistic Regression
Enter 1 or 2:
After training, the model will output evaluation metrics, including accuracy, precision, recall, F1-score, and confusion matrix.
To adjust the model's sensitivity to fraud detection (increase recall and detect more fraud cases), you can manually adjust the decision threshold. By default, logistic regression uses a 0.5 threshold. You can adjust this to make the model more sensitive.
# Example of evaluating the model with a threshold of 0.4
evaluate_model_with_threshold(model, X_test, y_test, threshold=0.4)
- Class Weighting: You can train the models with
class_weight='balanced'
to handle class imbalance automatically. - Oversampling with SMOTE:
To resample the training data using SMOTE, use the following code before training the model:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
The confusion matrix will be plotted after the model evaluation. It shows the true positive, false positive, true negative, and false negative counts, helping you visualize the model’s performance on fraud detection.
For example:
- True Negatives: Correctly predicted non-fraud cases.
- False Positives: Incorrectly predicted fraud cases when they are actually non-fraud.
- False Negatives: Missed fraud cases.
- True Positives: Correctly predicted fraud cases.
- Logistic Regression: A linear model that is simple but effective for fraud detection when well-tuned.
- Random Forest: A non-linear ensemble model that typically offers more flexibility and better performance for detecting complex patter 579F ns.
Both models are saved in the models/
directory after training:
models/logistic_regression_model.pkl
models/random_forest_model.pkl
The models are evaluated using the following metrics:
- Precision: Measures how many of the predicted frauds are actually fraud.
- Recall: Measures how many of the actual frauds were detected by the model.
- F1-score: Harmonic mean of precision and recall, providing a single measure of the model's accuracy.
- Confusion Matrix: Visualizes the performance by showing true positives, false positives, true negatives, and false negatives.
This project is licensed under the MIT License - see the LICENSE file for details.