The COVID Tracking Project

Objective:

The goal of this project is to apply the things we have learned in Data Engineering Zoomcamp to build an end-to-end data pipeline. This is done as the capstone project of the zoomcamp cohort 2024.

Problem Statement:

The dataset used for this project is the COVID tracking project, which collected and published the most complete testing data available for US states and territories. As the name suggests, it was tracking the COVID statistics both in state and national level all over the US.

Here, the data definitions are provided in detail: (https://covidtracking.com/about-data/data-definitions). They are also provided in metadata key in API responses. The categories offered in the dataset are:

Cases
PCR Tests
Antibody Tests
Antigen Tests
Hospitalization
Outcomes
State Metadata (Note: FIPS codes are numbers which uniquely identify geographic areas.)

These are also explained in dbt SQL files.

The data is available both as API and in csv format. The API is also available in 2 versions and the fields conversion between the two is provided.

Here, the data is collected via the API and all the requests are made and tested in postman and Python.

Notice: The COVID Tracking Project has ended all data collection as of March 7, 2021. The existing API will continue to work until May 2021, but will only include data up to March 7, 2021.

Data Pipeline:

The data pipeline has multiple steps and it runs in batch mode.

The data is fetched from API. (Python is used here.)
It is cleansed (on the basic level) and transferred to the data lake. (Postgres is used as the data lake.)
The data is loaded from the data lake to the data warehouse. (GCP is used as the cloud solution and BigQuery is the data warehousing tool.)

All the above steps are orchestrated as a workflow in Mage and Mage, Postgres and pgAdmin are run on Docker.

The transformations and data modeling are done with dbt via dbt cloud. Again, the data is loaded back into BigQuery.
The data is visualized in Looker Studio.

Implementation:

After cloning the repo, change the directory to mage folder:

  cd mage

After creating the GCP account here and creating a service account in IAM & Admin section of the website, copy and paste the json file for GCP service account key in the above folder. (GCP account setup is also explained in this repo)

Using Dockerfile and docker-compose.yml, a container is built and run with 3 images: Mage, Postgres and pgAdmin:

  docker compose up -d

The volume for pgadmin can also be mounted in the docker-compose.yml file, like this: volumes: - ./data_pgadmin:/var/lib/pgadmin

After running the container, the Mage service is available on port 6789 and pgadmin on port 8080. In pgadmin, the server can be created based on the data in the docker-compose.yml file and the config which can be created in .env file.

The pipelines can be found in mage/magic-zoomcamp/pipelines folder and the name of the blocks are mentioned in the metadata.yaml file. Also, the .py files which forms the pipelines in Mage are located in data_loaders, transformers and data_exporters folders.

Workflows:

Firstly, the covid_api_metadata_to_postgres pipeline fetches the metadata (states table in postgres) from the API and loads them into Postgres.

Then, the covid_api_data_to_postgres pipeline fetches the COVID data (dailytestresults table in postgres) from the API and loads them into Postgres. Note: the data for each state is extracted and concatenated with each other.

Finally, the load_data_from_postgres_to_bigquery pipeline transfers the data from Postgres as the data lake into BigQuery as the data warehouse.

Transformations and Data Modeling:

All the steps as depicted below, exist in the dbt_covid/models folder.

The whole model and the dependent .sql files can be built and run with dbt build and dbt run, respectively.

Looker Dashboard:

After the transformations made in dbt, fact_test_result table is connected to Looker via Google BigQuery connector and the dashboard is designed based on various metrics.

COVID Dashboard

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
dbt_covid		dbt_covid
mage		mage
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The COVID Tracking Project

Objective:

Problem Statement:

Data Pipeline:

Implementation:

Workflows:

Transformations and Data Modeling:

Looker Dashboard:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

RockSplendid/de_zoomcamp_project

Folders and files

Latest commit

History

Repository files navigation

The COVID Tracking Project

Objective:

Problem Statement:

Data Pipeline:

Implementation:

Workflows:

Transformations and Data Modeling:

Looker Dashboard:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages