This project aims to scrape data about past and present Premier League players from the Premier League website and further aguments it by scraping more information about the players from Wikipedia and WikiData.
The files in this project have been well organized. All Python scripts are in the scripts
folder. Within the scripts
folder, there are 2 subfolders:
notebooks
, that contains jupyter notebooks which were used for experimenting and data visualization, but aren't needed for running this project. Still, they have been included for completeness.data
, that contains additional data used for scripts, such as a set of stopwords to clean the parsed wikipedia articles.
The plots
folder contains the plots outputted from the visulziation notebooks.
The outputs
folder contains output files from most of the scripts. Due the the serial nature of our pipeline, the outputs from one script is generally fed into the next script as an input, so most of the inputs are also taken from this directory.
-
name_scanner.py
: Scans all URLs from 0 toID_MAX
(set to 100,000 by default) and checks if there is a valid player page there. If there is a page, it saves<ID>: <Name>
of that player tooutputs/player_names_ids.txt
. This file contains 6587 such pairs. -
pl_scrape.py
: Takes the ID-Name pairs fromoutputs/player_names_ids.txt
and scrapes their respective Overview and Stats page for various attributes. Once it has scraped all the pages and organized their contents, it creates a Pandas Dataframe from the data and writes it tooutputs/uncleaned_pl_scrape.csv
. We've chosen CSV as the intermediate format here since this scrape has a lot of numerical data, and it is easier to understand this data in a tabular format, which is easy to generate from a CSV. -
clean_and_stats.py
: Takesoutputs/uncleaned_pl_scrape.csv
andoutputs/uncleaned_wiki_scrape.json
, performs quantitative analysis of the data obtained and then cleans them. The plots generated can be saved if required. This script outputs one new JSON file for each of the inputted data files:outputs/cleaned_pl_scrape.json
andoutputs/cleaned_wiki_scrape.json
. -
wikidata_extact.sh
: Attributes are chosen based on a sample of 200 players and their wikipedia entry is used to determine how relevant each attribute is. Following that, the selected attributes are obtained for all players. -
merge.py
: Takes the two cleaned JSONs:outputs/cleaned_pl_scrape.json
andoutputs/cleaned_wiki_scrape.json
, and merges the attributes in them player-wise to generate a dictionary with all the cumulative data from both the scrapes. It then writes this to a file, generating our desired output:outputs/final.json
. -
notebooks/
: As mentioned earlier, this directory contains notebooks that were used for experimenting, intermediate steps, and data visualization. The twoclean_*.ipynb
files have been merged into theclean_and_stats.py
script, but theanalysis.ipynb
file is a standalone, that was used for data analysis and visualization to be used in the Report.
The explanation of these files have already been provided in the above section, but here is a quick overview:
-
player_names_id.txt
: Stores<ID>: <Name>
pairs for all the players to be scraped. -
uncleaned_pl_scrape.csv
: Stores the raw data obtained by scraping the Premier League website. -
uncleaned_wiki_scrape.json
: Stores the raw data obtained by scraping Wikipedia and WikiData sites. -
cleaned_pl_scrape.json
: Stores the cleaned data scraped from the Premier League website. -
data_200.json
: Stores tokenised wikipedia articles for 200 sample players. -
wikidata.json
: Stores Wikidata attributes for all players. -
wikipedia_links.json
: Stores Wikipedia article links for all players. -
wikidata_map.json
: Stores mappings of Wikipedia articles to WikiData id for all players. -
attr.json
: Stores the list of relevant wikidata attributes. -
cleaned_wiki_scrape.json
: Stores the cleaned data scraped from Wikipedia and WikiData -
final.json
: Stores the cumulative information about all the players from both the scrapes. This is the final output file.
Contains plots generated for the analysis of the data:
-
PL_Attribute_Dist_*.png
: Histograms showing the distribution of values across players for each of the numerical attribute obtained from the Premier League website. It has been split into 2 parts for better formatting in the Report. -
PL_Attribute_NaN_Percentage.png
: Barchart showing the percentage of entries that are NaN for each attribute in the data scraped from the Premier League website. -
Wiki_Attribute_NaN_Percentage.png
: Barchart showing the percentage of entries that are NaN for each attribute in the data scraped from Wikipedia and WikiData. -
Attribute_Corr_Matrix.png
: A Heatmap representing the correlation matrix between all the Numerica attributes in the total data collected.
- You can run
pip install -r requirements.txt
to install all the pip dependencies. - After that, you would need to install the BLINK library from here. This is required to fetch the data from Wikipedia.
Since all of our scripts work in a sequential manner, with the output of one feeding into the input another, we have come up with an easy bash script that executes the required Python scripts in order. To run the entire pipeline, just do:
chmod +x ./run.sh
./run.sh
This will run all the scripts and copy final.json
to the root directory of the project.
Note: The entire pipleline may take a few hours to run, as the scripts need to make a lot of web request to fetch all the required data.
Note: run.sh
won't generate the plot files, as those files have been manually saved from the notebooks. It will however show the plots in a window, so you can choose to manually save them if you like.
You can run individual scripts as well if you want, however, it is important that you run them from the scripts/
directory as the paths for the data files have been written relative to their location from the scripts/
directory. For example:
❌ Incorrect:
python3 scripts/pl_scraper.py
✔️ Correct:
cd scripts
python3 pl_scraper.py
run.sh
already takes this into account, so that can be run directly from the root directory of the project.