This project is using the 1.88 Million US Wildfires dataset from Kaggle. Using the size, time, and location of wildfire I am looking to train statistical models to predict the cause of the fire. There are 13 cause classes in the dataset, though two are removed since missing and unidentified are both classes, leaving 11 classes. This represents a multi-class classification problem lending itself to a tree and or boosting approach. After exploratory data analysis we fit a number of models, some of which provide decent results. In general the models are able to accurately predict fires from the more frequent classes, but struggle when making classifications on the less common.
Since the dataset contained 1.88 million entries the runtime for the model fitting was very long. Since it took <10 minutes for the Rmarkdown
document to knit
we chose to write the report in a jupyter notebook since all of the code did not need to be run every time we render the document.
The report was also exported as an html document which can be previewed here: