The goal of this project is to apply the things we have learned in Data Engineering Zoomcamp to build an end-to-end data pipeline. This is done as the capstone project of the zoomcamp cohort 2024.
The dataset used for this project is the COVID tracking project, which collected and published the most complete testing data available for US states and territories. As the name suggests, it was tracking the COVID statistics both in state and national level all over the US.
Here, the data definitions are provided in detail: (https://covidtracking.com/about-data/data-definitions). They are also provided in metadata key in API responses. The categories offered in the dataset are:
- Cases
- PCR Tests
- Antibody Tests
- Antigen Tests
- Hospitalization
- Outcomes
- State Metadata (Note: FIPS codes are numbers which uniquely identify geographic areas.)
These are also explained in dbt SQL files.
The data is available both as API and in csv format. The API is also available in 2 versions and the fields conversion between the two is provided.
Here, the data is collected via the API and all the requests are made and tested in postman and Python.
Notice: The COVID Tracking Project has ended all data collection as of March 7, 2021. The existing API will continue to work until May 2021, but will only include data up to March 7, 2021.
The data pipeline has multiple steps and it runs in batch mode.
- The data is fetched from API. (Python is used here.)
- It is cleansed (on the basic level) and transferred to the data lake. (Postgres is used as the data lake.)
- The data is loaded from the data lake to the data warehouse. (GCP is used as the cloud solution and BigQuery is the data warehousing tool.)
All the above steps are orchestrated as a workflow in Mage and Mage, Postgres and pgAdmin are run on Docker.
- The transformations and data modeling are done with dbt via dbt cloud. Again, the data is loaded back into BigQuery.
- The data is visualized in Looker Studio.
After cloning the repo, change the directory to mage folder:
cd mage
After creating the GCP account here and creating a service account in IAM & Admin section of the website, copy and paste the json file for GCP service account key in the above folder. (GCP account setup is also explained in this repo)
Using Dockerfile
and docker-compose.yml
, a container is built and run with 3 images: Mage, Postgres and pgAdmin:
docker compose up -d
The volume for pgadmin can also be mounted in the docker-compose.yml
file, like this: volumes: - ./data_pgadmin:/var/lib/pgadmin
After running the container, the Mage service is available on port 6789
and pgadmin on port 8080
. In pgadmin, the server can be created based on the data in the docker-compose.yml
file and the config which can be created in .env
file.
The pipelines can be found in mage/magic-zoomcamp/pipelines
folder and the name of the blocks are mentioned in the metadata.yaml
file. Also, the .py
files which forms the pipelines in Mage are located in data_loaders
, transformers
and data_exporters
folders.
Firstly, the covid_api_metadata_to_postgres
pipeline fetches the metadata (states table in postgres) from the API and loads them into Postgres.
Then, the covid_api_data_to_postgres
pipeline fetches the COVID data (dailytestresults table in postgres) from the API and loads them into Postgres. Note: the data for each state is extracted and concatenated with each other.
Finally, the load_data_from_postgres_to_bigquery
pipeline transfers the data from Postgres as the data lake into BigQuery as the data warehouse.
All the steps as depicted below, exist in the dbt_covid/models
folder.
The whole model and the dependent .sql
files can be built and run with dbt build
and dbt run
, respectively.
After the transformations made in dbt, fact_test_result table is connected to Looker via Google BigQuery connector and the dashboard is designed based on various metrics.