Money Diaries

Problem Description

Money Diaries is a blog created and hosted by Refinery29. The blog's first post came out in 2016, and each week three new anonymous contributes post about their money spending habits. All posts have a similar structure:

The first section has information on the contributor's occupation, the industry they work in, their age, location, salary, net worth, debt, paycheck amount and cadence, and their pronouns
The second section covers their monthly expenses
The third section is a list of background questions
The fourth section is the diary, where the contributor tracks and shares their spending for a week
The fifth and final section is the breadkown, where all their spending is aggregated into six cateogries: Food & Drink, Home & Health, Clothes & Beauty, Entertainment, Transport, Other

The blog is a rich source of data for anyone who wants to learn about how money shapes our lives. I built a dashboard that summarizes some of the information in the blog such as salary distribution, prices over time, and debt vs. net worth. The dashboard can be accessed here.

Data flow chart

BigQuery

When running the project with BigQuery, the data flow works as follows:

Terraform is used to deploy two resources: a Cloud Run Service and a BigQuery dataset
The Cloud Run services deploys a docker image that contains Mage.ai and dlt
The data is ingested from the blog using dlt and is stored in BigQuery
dbt is used to optimize and transform the data and to create three reporting views that will be used by the dashboard
The dashboard is built with Streamlit. It queries the reporting tables and visualizes the data

PostgreSQL on NEON

When running the project with PostgreSQL on NEON, the data flow works as follows:

The data is ingested from the blog using dlt and is stored in PostgreSQL
dbt is used to optimize and transform the data and to create three reporting views that will be used by the dashboard
The dashboard is built with Streamlit. It queries the reporting tables and visualizes the data

Cloud

The following cloud resources are provisioned using Terraform:

Mage.ai with dlthub
BigQuery dataset

See instructions in the Reproducibility section for details on how to deploy and run these resources.

Data ingestion

Data is ingested from the blog using a pipeline that is created in Mage.ai. The DAG steps are:

Scrap the blog's home page* and load it's json objects in BigQuery using dlt
Extract the blog urls from diary_links__rows__entities table
Scrap the blog posts and load their json contents in BigQuery using dlt

* Refinery29 webstie's robot.txt file has no explicit restrictions on scrapping the Money Diaries blog posts. You can check it here

Data warehouse

The ingested data is stored in a BigQuery data warehouse. dbt transformations and optimizations access this warehouse, as well as the streamlit dashboard.

Transformations

dbt is used to create optimized tables that host the data that will be used by the dashboard. For more details on how to partition and cluster tables in dbt check the README page under the dbt directory.

Also, dbt creates three reporting views that will be used by the dashboard.

The dbt pipeline involves three stages:

Staging: The four tables that will stage the data for the dashboard are created and optimized (using clustering and partitioning)
Core: The tables are joined together to create one source of truth table that will be used for reporting
Reporting: Three views are created to serve data to the dashboard charts

Dashboard

The dashboard has the following charts:

Metric overlays showing average prices per category for the most recent month, with the delta from the month prior
Salary distribution histogram
A line chart showing the average category prices per month over time
Debt-Net worth diverging bar chart

Reproducibility

⚠️ Note: Before running these commands, please note that these resources are not free, Google Cloud will charge you for running these services. I recommend creating a new GCP account and using the free credits to run this pipeline. You can read more on Google Cloud billing here.

To recreate this dashboard, follow the instructions here

Next steps

Create a streaming pipeline that extracts new posts
Create unit tests for all steps
Build a second pipeline that uses OCR to extract breakdown data (the scrapped data has a lot of null values)
Add filters and slicers to the dashoard
Use the stored data to build a financial RAG
Store reactions to enrich the data

Todo

debuging mage

fails when pipeline run with trigger succeeds when executed step-by-step unclear error message:

"Pipeline execution failed at stage sync with exception:\n\n<class 'dlt.destinations.exceptions.DatabaseTransientException'>\nSSL SYSCALL error: EOF detected\n")

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
cpi		cpi
dbt		dbt
docker		docker
docs		docs
images		images
mage_data		mage_data
secrets.streamlit		secrets.streamlit
streamlit-moneydiaries		streamlit-moneydiaries
terraform		terraform
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Money Diaries

Problem Description

Data flow chart

BigQuery

PostgreSQL on NEON

Cloud

Data ingestion

Data warehouse

Transformations

Dashboard

Reproducibility

Next steps

Todo

debuging mage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

el-grudge/money-diaries

Folders and files

Latest commit

History

Repository files navigation

Money Diaries

Problem Description

Data flow chart

BigQuery

PostgreSQL on NEON

Cloud

Data ingestion

Data warehouse

Transformations

Dashboard

Reproducibility

Next steps

Todo

debuging mage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages