This project is for a competition from drivendata.org.
Using data on water pumps in communities across Tanzania, can we predict the pumps that are functional, need repairs, or don't work at all?
The data: Competition for drivendata.org. Contains a mix of continuous and categorical variables about what kind of pump is operating, when it was installed, how it was managed, etc.
preprocessing.py contains functions for preprocessing data.
WaterPumps.ipynb goes through exploratory analysis and data preprocessing.
WaterPumpsII.ipynb has model training and evaluation.
To see a full decription of the project, see the blog post here.