On-prem DB to Azure Cloud Pipeline with Data Factory, Lake Storage, Spark, Databricks, Synapse, PowerBI
- Project Overview
- Key Insights
- Project Architecture
3.1. Data Ingestion
3.2. Data Transformation
3.3. Data Loading
3.4. Data Reporting - Credits
- Contact
This an end-to-end data engineering project on the Azure cloud. Where I did data ingestion from a on-premise SQL Server to Azure Data Lake using Data Factory to transformation using Databricks and Spark, loading to Synapse, and reporting using PowerBI. Also, I used Azure Active Directory (AAD) and Azure Key Vault for the data monitoring and governance purpose.
AdventureWorks is a database provided by Microsoft for free on online platforms. It is a product sample database originally published by Microsoft to demonstrate the supposed design of a SQL server database using SQL server 2008. Here are some key points to know about AdventureWorks:
- AdventureWorks database supports a manufacturing MNC named Adventure Works Cycles.
- It is a sample Online Transaction Processing (or OLTP) database, which is a type of data processing where multiple transactions occur concurrently. These are shipped by Microsoft with all of their SQL server products.
For this project I used the Lightweight (LT) data: a lightweight and pared down version of the OLTP sample. Download here
- Establish a connection between on-premise SQL server and Azure cloud.
- Ingest tables into the Azure Data Lake.
- Apply data cleaning and transformation using Azure Databricks.
- Utilize Azure Synapse Analytics for loading clean data.
- Create interactive data visualizations and reports with Microsoft Power BI.
- Implement Azure Active Directory (AAD) and Azure Key Vault for monitoring and governance.
-
πΈ Total Revenue by Product Category
- Touring Bikes is the top 1 category generating revenue with 32% followed by Road Bikes with 26% and Mountain Bikes with 24%.
-
π Sales by Country
- NΒ°1: The United Kingdom (UK) have the most total sales with 278 and $572,000 of total revenue.
- NΒ°2: The United States of America (USA) is second with total sales with 264 and $383,810 of total revenue.
-
π» Revenue by Gender
- 81% of the revenue is generated by Male customers
- 19% of the revenue is generated by Female customers
This can be explained by males have more interest in doing outdoor activites with the different categories of Bikes than females.
You can find the detailed information on the diagram below:
- Connected the on-premise SQL Server with Azure using Microsoft Integration Runtime.
- Setup the Resource group with needed services (Key Vault, Storage Account, Data Factory, Databricks, Synapse Analytics)
- Migrated the tables from on-premise SQL Server to Azure Data Lake Storage Gen2.
- Mounted Azure Blob Storage to Databricks to retrieve raw data from the Data Lake.
- Used Spark Cluster in Azure Databricks to clean and refine the raw data.
- Saved the cleaned data in a Delta format; optimized for further analysis.
- Used Azure Synapse Analytics to load the refined data efficiently.
- Created SQL database and connected it to the data lake.
- Connected Microsoft Power BI to Azure Synapse, and used the Views of the DB to create interactive and insightful data visualizations.
- Data Source: SQL Server
- Orchestration: Azure Data Factory
- Ingestion: Azure Data Lake Gen2
- Storage: Azure Synapse Analytics
- Authentication and Secrets Management: Azure Active Directory and Azure Key Vault
- Data Visualization: PowerBI
- This Project is inspired by the video of the YouTube Channel "Mr. K Talks Tech"