A Probabilistic Approach to Diabetes Risk Assessment Using Bayesian Networks

This project explores the application of Bayesian Networks (BNs) for diabetes risk assessment. Various Bayesian structures, including Naïve Bayes, Hill Climbing (BIC, K2, BDeu), Simulated Annealing, and a Domain Knowledge-Based model, are developed and evaluated to determine the most effective framework for prediction.

Dataset

The dataset used in this study can be accessed here: Diabetes Health Indicators Dataset.

Methodology

Feature Selection: SHAP values from a trained XGBoost classifier are used to identify important features. Additionally, features are ranked based on Mutual Information (MI) to gain deeper insights into their importance.
Structure Learning: Bayesian Networks are constructed using both data-driven and domain knowledge-based approaches.
Parameter Estimation: Maximum Likelihood Estimation (MLE) is applied to estimate Conditional Probability Distributions (CPDs).
Evaluation: Models are assessed using the AUC-ROC evaluation metric to identify top-performing models.
Inference: The top-performing Bayesian Networks are employed for probabilistic reasoning, enabling predictions and risk assessments based on given evidence, with Variable Elimination used as the inference method.

Proposed Bayesian Network Structure

Top-Performing Bayesian Network Structure Learned via AI Techniques

Hill Climbing Search with K2 Scoring Method

Repository Structure

Folders

feature selection
- shap_and_mi.py: Trains an XGBoost classifier on the dataset and analyzes feature importance using SHAP values with TreeExplainer. Then, applies Scikit-learn's mutual information to rank features and finally examines the correlation between the top 10 features identified by each method.
bayesian network modeling
- ai_based_structure_learning.py: Data-driven Bayesian Network construction using various AI techniques.
- domain_knowledge_driven.py: Manually designs the Bayesian Network structure using insights from the research literature.
- For both Bayesian Modeling approaches, pgmpy library is used.
evaluation & inference
- evaluation_and_inference.py: Performs the parameter estimation using MLE, evaluates models using AUC-ROC metric, and then performs probabilistic inference using Variable Elimination.
models
- Contains saved checkpoints (.pkl) of the learned models required for the diabetes_risk_analysis_(lightweight).ipynb notebook.
dataset
diabetes_indicators.csv: Includes the cleaned dataset before feature selection.
df_selected.csv: Provides the dataset containing only the selected features.

Usage

Install Dependencies

Clone the repository and install the required packages:

git clone https://github.com/faezesarlakifar/Unibo-FAIKR-M3-project
cd "Unibo-FAIKR-M3-project"
pip install -r requirements.txt

Run Inference

cd "evaluation & inference"
python inference.py

Notebooks

You can directly access the main notebook with all the experiment results here: diabetes_risk_analysis_(lightweight).ipynb
A notebook containing all existing evaluation steps (each variable elimination) is Diabetes_Risk_Prediction.ipynb. Since it has a large file size, a lightweight version is provided for easier exploration."

Cool insights

ShAP Analysis Results

Top features are selected based on this result.

SHAP dependence plot for Education vs GenHealth

This plot suggests that individuals with higher education levels have better general health. (Education levels range from 1 lowest to 6 highest, while GenHealth values are in reverse order, with 5 indicating the poorest health and 1 the best). More insightful plots are available in the feature engineering section of the main notebook. 🙂

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Probabilistic Approach to Diabetes Risk Assessment Using Bayesian Networks

Dataset

Methodology

Proposed Bayesian Network Structure

Top-Performing Bayesian Network Structure Learned via AI Techniques

Hill Climbing Search with K2 Scoring Method

Repository Structure

Folders

Usage

Install Dependencies

Notebooks

Cool insights

ShAP Analysis Results

SHAP dependence plot for Education vs GenHealth

About

Uh oh!

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
bayesian network modeling		bayesian network modeling
dataset		dataset
docs		docs
evaluation & inference		evaluation & inference
feature selection		feature selection
latex report		latex report
models		models
plots		plots
.gitattributes		.gitattributes
Diabetes_Risk_Prediction.ipynb		Diabetes_Risk_Prediction.ipynb
LICENSE		LICENSE
README.md		README.md
diabetes_risk_analysis_(lightweight).ipynb		diabetes_risk_analysis_(lightweight).ipynb
requirements.txt		requirements.txt

License

faezesarlakifar/Unibo-FAIKR-M3-project

Folders and files

Latest commit

History

Repository files navigation

A Probabilistic Approach to Diabetes Risk Assessment Using Bayesian Networks

Dataset

Methodology

Proposed Bayesian Network Structure

Top-Performing Bayesian Network Structure Learned via AI Techniques

Hill Climbing Search with K2 Scoring Method

Repository Structure

Folders

Usage

Install Dependencies

Notebooks

Cool insights

ShAP Analysis Results

SHAP dependence plot for Education vs GenHealth

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages