Details and write-up about this project are available here.
- Files.xlsx
- Contains the matched indexes of the NHANES files across the years sorted alphabetically, indicates features used for analysis, and re-naming of features across the years for consistency.
- run_notebooks.py
- Script to batch run Jupyter Notebooks
- Jupyter notebook output from running this code is available in the /output folder for each respective folder
- /NHANES-Downloader
- Scripts to download NHANES files and converts them to csv files
- /Data Cleaning
- Notebooks that clean, rename, recategorize, and remove missing values for NHANES data files and uploads it to a local NoSQL database.
- /Data Upload
- Notebooks that merge appropriate files for analysis, recategorize labels, one-hot encode features, and upload data to a local NoSQL database.
- /Data Analysis
- Notebooks for exploratory data analysis, fitting random forest and XGBoost models to the data, and evaluating the performance of the models. Identify risk factors for hospital utilization and major diseases.
- /Prevalence
- Notebooks to generate data (.csv) files for prevalence plots
- R Notebooks to generate prevalence plots
- Download files from NHANES using NHANES-Downloader scripts
- Navigate to the NHANES-Downloader folder and follow the README.
- In terminal go to NHANES-Downloader folder and run (MAC):
$ ./get_data.py
$ ./raw_to_csv.py
- Run Jupyter Notebooks to clean clean, upload, and anlayze data:
- Navigate to the root folder and in terminal run:
$ python run_notebooks.py ./Data\ Cleaning/*.ipynb ./Data\ Upload/*.ipynb
- Run Jupyter Notebooks to analyze data (This may take a while ~15 mins):
- Navigate to the root folder and in terminal run:
$ python run_notebooks.py ./Data\ Analysis/*.ipynb
- Run Jupyter Notebooks to generate data for prevalence plots:
- Navigate to the root folder and in terminal run:
$ python run_notebooks.py ./Prevalence/*.ipynb
- Run R Markdown files in Q1 and Q2 folders individually to generate prevalence plots:
- Make sure to set the approriate working directories before running all.
- Q1 - Hospital Utilization.Rmd
- Q2 - 1. Heart Disease.Rmd
- Q2 - 2. Cancer.Rmd
- Q2 - 3. Chronic Lung Disease.Rmd
- Q2 - 4. Diabetes.Rmd