This long article will help you to prepare for passing DP-200 exam. The first part of the article is about all the steps to build an analytic solution from scratch. And bonus, you will also get a cool demo to showcase to your customers.
The second part of this article provides some tips and tricks to help you pass the exam. I highly recommend to “follow the white rabbit” .
To build our solution, we will use several services of our data platform
- Azure Data Lake Gen 2, for storage
- Azure Data Factory, for orchestrating the production pipeline
- Azure Databricks, for the injection of innovations, experimentation and transformations
- Azure SQL Database for the data presentation layer
- Power BI Desktop, for data mining and reporting
Of course we will add some infrastructure components as well:
- Azure Active Directory
- Azure Key Vault
Before we begin, let's do a quick review of Azure Data Factory and Azure Databriks
Azure Data Factory v2 is a fully managed cloud service that enables the orchestration of data movement and transformation processes. Completely integrated with Azure, it can invoke other services during the data movements, such as Azure Databricks or Logic App, to transform and enrich your data. Azure Data Factory v2 also integrates with Azure Devops for integration and continuous development. In addition, with the ability to transform pipelines into ARM models, it makes very easy to deploy them to other environments.
Based on Spark, Azure Databricks is a workspace that fosters collaboration between the company's stakeholders to bring innovation through continuous experimentation. Being integrated with Azure DevOps, Azure Databricks fits neatly in a CI / CD process as illustrated in Benjamin Leroux's post. Azure Databricks allows the collaboration around Notebooks, which support several languages such as Python, Scala, SQL, ..., and allows the execution of these notebooks both manual and automatic, via "jobs", orchestrated by Azure Data Factory.
The notebook used for this example will do the following:
- Read files deposited by Azure Data Factory into the data lake (Azure Data Lake Storage Gen2)
- Process files to extract information
- Create an archive file in the data lake
- Insert the results into a SQL Database table for the visualization
The notebook used for this example is available here or here
In modern data platform architecture solutions, it is important to separate storage space from data processing engine. This is exactly what we will do in this example. For this article, we will develop our data lake in several zones: one to to preserve the original state of the data, one staging (temp area), and one to store the end result of the transformations as shown in the illustration below
At the end of this article we will have implemented the following architecture. You will notice the separation of the storage from the calculation engine
A resource group is a logical grouping of Azure services. For this article, we will group all the services used to build our solution in one resource group. Nothing prevents you from organizing your Azure services in a different way.
From the Azure portal, on the left, click on "Resource groups" then on the "Add" button.
Complete the creation form and click on "Review + Create". For this article, we will name the resource group "Paisley-Park".
In the previously created resource group, we will add a storage service. For this example we will choose the new service "Data Lake Store Gen 2".
From the Azure portal, click on "Create a resource", "Storage" and then "Storage account"
Then fill out the creation form. Here, I decide to create my data lake in the "Paisley-Park" resource group.For this article, I will name my data lake "thevaultgen2"
Then click on the "Advanced" tab, then in the "Data Lake Storage gen2" section, click on "Enabled". This will enable new features related to the new version of our data lake.
If necessary you can add tags by clicking on the tab "Tags". Tags are useful for administering and tracking the costs of your Azure services.
When done, click on the "Review + create" button
Then on the "Create" button
From the Azure portal, click on "Create a resource", "Database", then "SQL Database"
Fill out the database creation form and click on "Review + Create". For this article, I will call the database "Musicology", and I will keep the level of Tier to basic. Nothing will prevent us subsequently to change this level of performance according to the needs (principle of elasticity). (This change can also be done programmatically by using SERVICE_OBJECTIVE argument, but we will discuss this point later in the article )
On the "Create + Review" page, click on the "Create" button
Once the database is created, consider setting up the "firewall" in case you want to use SQL Server Management Studio from a remote desktop. Refer to the documentation.
From the portal, click on "Create a resource", "Integration" and then "Data Factory"
Fill out the creation form and click on the "Create" button
From the Azure portal, click "Create a resource", "Analytics" and "Azure Databricks"
Fill out the creation form. We will take the premium tier for integration with Azure Key Vault. The premium tier also supports role-based access control, refer to pricing tiers.
This is especially recommended in the Databricks documentation: "Your account must have the Azure Databricks Premium Plan for you to be able to select Creator. This is the recommended approach: grant MANAGE permission to the Creator when you create the secret scope, and then assign more granular access permissions after you have tested the scope. For an example workflow, see Secret Workflow Example."
For this article, I will name my Azure Databricks "CrystalBall" workspace.Click on the "Create" button
From the Azure portal, click on "Create a resource", then in the search bar enter "Key Vault"
Choose "Key Vault" from the Marketplace and click on the "Create" button.
Fill out the creation form and click on the "Create" button
If you have placed all your resources in the same resource group, you will see a result similar to the screenshot below:
##Creation of an application and a service principal
In this section of the article, we will create the security elements that allow Azure Databricks to access our Data Lake gen2. The general procedure can be found here.
In our Azure Active Directory, we will create and register an application.
From the Azure portal, click on "Azure Active Directory" and then on "App registration".
Click on the "New registration" button
Fill out the registration form and click on the "register" button
The application is now created. You must have a screen identical to the one below.
**WARNING!!! ** Note the information of your "Application (client) ID" as well as your "Directory (tenant) ID" (which I have hidden in my screenshot). You will need it later.
Click on "Certificates & Secrets" and then on "New Secret Client". This "secret ID" is also called "authentication-id" or "secret client"
Fill out the creation form and click on the "Add" button
Copy the previously generated secret by clicking on the copy icon
Return to your resource group and select your storage account
Then click on "Access Control (IAM)" (Actually, we are in the process to define access management for Cloud resources. To manage access control, we use what we called Role-based access control (RBAC). This article will give you a good overview of what RBAC is. )
In the "Add a role assignment" box, click on the "Add" button
In the "Add role assignment" window, select the following items:
- Role: Storage Blob Data Contributor
- Select: The name of the application created previously. Here, "FranmerTheVaultGen2".
Then click on the "Save" button
In order for Azure Databricks to have access to Data Lake Gen2, we will need the following information:
- storage account name
- Application-id
- authentication-id
- tenant-id
As data engineers and scientits work in notebooks, the best practice is to avoid having secret information appear in clear text in the code. This is where Azure Key Vault comes into action to protect passwords or logins. For this example we will store only the different IDs.
Click on your Azure Key Vault:
Click on "Secrets" then on the "Generate / Import" button
We will create the secrets for our 3 IDs. Below an example for "application-id".
Click on the "Create" button. Repeat the following steps for the other 2 IDs.
After creating your 3 secrets, you must have a result like the screenshot below:
A little further down in this article, we'll add a secret for the integration between Azure Data Factory and Azure Databricks
We will now configure Azure Databricks so that it can use the secrets stored in Azure Key Vault. For information, the complete documentation is here.
Return to your resource group and click on your Azure Databricks service:
Click on the button "Launch Workspace"
Once in your workspace, note the URL of your Azure Databricks:
Then compose a new URL by adding "#secrets/createScope" (Attention to the capital "S"!) In my case the URL will be: ("https://eastus2.azuredatabricks.net#secrets/createScope")
The next window will open, this is where we will create references to the secrets present in our Azure Key Vault
To do this, we will need the following information from our Azure Key Vault:
- DNS Name
- Resource ID
This information can be found in the properties of Azure Key Vault (see screenshot below)
Once the information is found, the creation of a "Secret Scope" is as shown below. After entering all the required information, click on the "Create" button
If all goes well, you must have the following message. Click on the "OK" button to validate.
In this next section, we will download the notebook (DatabricksNotebook_Wikipedia_ADLSGen2_Generic.dbc) which is at the following address: https://1drv.ms/u/s!Am-C-ktMH9lgg9MCqqG5dS8XKFKxDA or https://github.com/franmer2/HelpForDP200/tree/master/resources.
Note that Databricks notebooks have the DBC file extension.
You will then have to modify the notebook to add your information concerning the application ID, the "data lake" as well as your SQL Database.
From your Azure Databricks workspace, navigate to where you want to import the notebook. Click on the down arrow to bring up the context menu and click on "Import".
Check the "File" box, enter the path to the notebook and click on the "Import" button
If all goes well, the notebook should be in your workspace as shown below
========================================
Later in this article, we will see how to use this notebook with interactive cluster. It would be a good option to test the notebook before you run it in production through Azure Data Factory, especially if you changed the names of the Azure resources, data lake zones/layout, or Key Vault secrets.
========================================
We will create the access tokens for Power BI and Azure Data Factory. At the top right of your workspace, click on the character icon and then on "User Settings"
Click on the "Generate New Token" button
Give a description to the Token, then indicate its lifetime. Click on the "Generate" button.
WARNING !! Copy your Token in a reliable place, because after pressing the "Done" button, it will no longer be accessible.
Repeat to create a Token for Power BI. You must have a result similar to the screenshot below:
Note: The token for Azure Databricks needs to be added in Azure Key Vault. We will add the secret "Databricks"
=============================================================
Below is a screenshot after adding the "Databricks" secret in Azure Key Vault:
=============================================================
We will now develop our data lake to create an area for raw data from Wikipedia, and an area with the result of the analyzes
The zone "demo_datasets" will be created manually. The "wikipedia_results" zone will be created automatically by Azure Databricks. In order to interact with Azure Data Lake Gen2, you can use Azure Storage Explorer. From Azure Storage Explorer, sign in to your Azure account. Find your lake of data, do a right click on it, then create a container "wikipedia".
You must obtain a result similar to the screenshot below:
OPTIONAL (Azure Data Factory can create the folder automatically if it is not already present). In your "wikipedia" container, create the directory "demo_datasets".
Click on the "New Folder" button
You should obtain the result below:
The pipeline is quite simple for this example and will consist of the sequence of the following activities, while being based on input parameters (which is, by the way, one of the elements of DataOps):
- Test the presence of the "demo_datasets" folder in the storage account
- Delete files from the raw area (demo_datasets) of the data lake
- Download log files (depending on user settings)
- Invoke an Azure Databricks notebook for data transformation and writing of results to another area of the data lake and in SQL Database.
In the end, we will get a pipeline like the one shown below:
From the Azure portal, connect to your Azure Data Factory service
Click on "Author & Monitor"
(If you use MSDN subscription and facing infinite white screen, try to open your browser "in private" mode)
We will create 4 Linked Services for:
- Azure Data Lake storage Gen2
- The Wikipedia log site
- Azure Key Vault
- Azure Databricks
In this section of the article, we will create 2 Linked Services. The other 2 will be created during dataset creation, to illustrate another way to create them.
We will start by creating a Linked Service to the "Azure Key Vault" service.
From the authoring interface, click on the pencil on the left, then on "Connections"
Click on the "New" button
Search and select "Azure Key Vault". Click on the "Continue" button
Enter information about your Azure Key Vault service. Click on "Test connection" then on "Finish"
=============================================================
WARNING !!! You noticed the "Edit Key vault" message just below the "Azure key vault name" field. It is necessary to add the "Service identity application" in the "Access policies" section of the Azure Key Vault service.
In your Azure Key Vault service, click "Access policies" and then the "Add new" button to search for the service (with the Service identity application number) and give it access rights.The "GET" right for the "Secret permissions" field is sufficient
=============================================================
After creating your linked service, you should get a result similar to the one shown below:
Click on the "Publish all" button
We will repeat the operation to create the Linked Service to Azure Databricks
Fill in the login information for Azure Databricks. Use the secret we created earlier in Azure Key Vault
To illustrate another possibility of creating linked services, we will create the other 2 linked services at the same time as the datasets in the next paragraph.
We will now create 3 "datasets":
- 1 to retrieve Wikipedia logs (http), also with parameters
- 1 for the data lake
- 1 for the same data lake but with parameters
The data type for all these files is binary.
Click on the "+" sign, then on "Dataset"
In the "New Dataset" section, look for "http", then click on the "http" icon and then on the "Finish" button
In the "General" tab, define a name for your dataset.
Then click on "Parameters".Click on the "New" button each time to add a parameter
Enter the following parameters in "String" format:
- YearDS
- MonthDS
- DayDS
- HourDS
Click on the "Connection" tab
Here we will see how to create a Linked Service from the creation phase of the dataset. In the "Connection" section click on "New" to create a linked service to the Wikipedia logs site
Give a name to your linked service. In the field, base URL, enter the following URL: https://dumps.wikimedia.org/other/pageviews/.
Click on "Disable" and select "Anonymous" in the "Authentication type" fieldTest your connection and click on the "Finish" button
Once the linked service is created, you return to creating your dataset.
Click in the field "Relative URL" then on "Add dynamic content"
Copy the string below into the "Add Dynamic Content" field and click on "Finish" (you may have an error after the copy/paste, in this case, double check the quotes (')):
@concat(dataset().YearDS,'/',dataset().YearDS,'-',dataset().MonthDS,'/pageviews-',dataset().YearDS,dataset().MonthDS,dataset().DayDS,if(less(int(dataset().HourDS),10),'-0','-'),dataset().HourDS,'0000.gz')
Then check the "Binary copy" box
Click on the "Publish All" button
Now, let's create a second Dataset for our data lake. Repeat the creation procedure as previously seen, then select "Azure Data Lake Storage Gen2"
Give a name to this new dataset.
Click on "Connection" and then on the "New" button to create a linked service to our data lake.
In the "New Linked Service" window, select your storage account from your Azure subscription:
- The "Tenant" field should automatically be filled.
- The "Service Principal ID" is the one we created at the beginning of the article
- Select "Azure Key Vault"
- Select your "Azure Key Vault" service
- Indicate the name of the secret that protects the key of your main service (in my example: "authentication-id")
Click on the "Finish" button
Now that the linked service to our data lake has been created, we will continue configuring our dataset. In the "File path" field, we will enter the path "Wikipedia / demo_datasets".
Check the "Binary copy" box
We will repeat the following procedure to create a dataset on the same data lake, but by adding parameters. We will use the same linked service. On the second dataset of the data lake add the following parameters:
- YearDS
- MonthDS
- DayDS
- HourDS
Then in the "Connection" tab, define the path as shown below by adding the following expression in the "File" field:
@concat('pageviews-',dataset().YearDS,dataset().MonthDS,dataset().DayDS,if(less(int(dataset().HourDS),10),'-0','-'),dataset().HourDS,'0000.gz')
At the top of the screen, click on the "Publish all" button.
Click on the "+" sign and then on "Pipeline"
In order to make our pipeline more agile, we will add parameters, which will allow us, for example, to process the logs according to the date and times that interest us.
Click in a free space in the Pipeline Editor, then click on "Parameters" then on "New" (click 4 times on "New" to add 4 parameters)
Add the following parameters:
Nom | Type |
---|---|
Year | String |
Month | String |
Day | String |
Hours | Array |
In the example, I added some default values for the example (especially in the Array parameter). But these default values are not required.
As this activity does not exist yet in the list of available activities, we will create it by code. (Edit: the delete activity is now available through the UI, but you can still do it via the code just for training )
Click on the "Code" button to display the code editor
Copy and paste in the editor the script below. Remember to change the name of the dataset. (In my example it is "AzureDataLakeStorageFile1", the data lake dataset without parameters).
Click on the "Finish" button
{
"name": "pipeline1",
"properties": {
"activities": [
{
"name": "Delete Activity",
"type": "Delete",
"policy": {
"timeout": "7.00:00:00",
"retry": 0,
"retryIntervalInSeconds": 30,
"secureOutput": false,
"secureInput": false
},
"typeProperties": {
"dataset": {
"referenceName": "<Your Dataset Name>",
"type": "DatasetReference"
},
"recursive": true,
"enableLogging": false,
"maxConcurrentConnections": 1
}
}
]
}
}
You should see a new activity in your pipeline:
The screenshot below show how to create the 'Get Metadata'activity.
Add the "Exists" argument. It will be tested by another activity
Drag and drop the "If Condition" activity from the left menu to the pipeline. Add a link between the "Get Metadata" activity with the "If Condition" activity.
Click on the "If Condition" activity. In "settings" tab, clic on "Add dynamic content"
Add the following content:
@activity('Get Metadata1').output.Exists
Right click on the "Delete" activity and click on "Copy"
Click again on the "If Condition" activity and then on the "Activities" tab.Then click on the "Add if True Activity" button
Right click in the central part to copy the activity. Click on "Paste"
The "Delete" activity is copied to your "If condition" activity.
Click on the "Pipeline1" (or another name if you renamed your pipeline) link to return to the top level of your pipeline and remove the "Delete1" activity.
We will now create a loop to download the files corresponding to the number of hours that we want to recover. This number will be defined via the parameters.
From the list of activities, on the left, drag and drop the activity "ForEach" in the editor. Then link the "If condition" activity to the "ForEach" activity.
Click on the "ForEach" activity and then "Settings". Click in the "Items" field then click on "Add dynamic content"
Add the following content:
@pipeline().parameters.Hours
Then click on the "Finish" button
The "Hours" parameter is now available in the "Item" section of the "ForEach" activity. This loop will now be repeated according to the number of hours that will be defined in the "Hours" parameter.
Double click on the activity "ForEach", to add activities inside the loopAdd the "Copy Data" activity in the "ForEach" loop
Click on "Source" and select the Dataset "Logs_Wikipedia". Pass the settings as shown in the screenshot below.
- YearDS : @pipeline().parameters.Year
- MonthDS : @pipeline().parameters.Month
- DaysDS : @pipeline().parameters.Day
- HourDS : @item()
(Concerning the hours, parameter comes actually from the activity "ForEach").
Click on "Sink" tab. In the "Sink Dataset" field, choose the data lake dataset with the parameters.
Then fill in the parameters with the values as shown in the screenshot below.
Once the changes are complete, click on the "Publish All" button
Now we're going to add Databricks's Notebook Activity as shown below
Once the activity has been added, link it on the output of the "For Each" activity
Click on the "Notebook" activity, then on "Azure Databricks". Choose the linked service created previously and click on "Test connection":
Then click on "Settings" and then on the "Browse" button
Then click on "Base Parameters" to create the parameters that will be passed to your Azure Databricks Notebook.
Respect the following names. If you want to change the names of the basic parameters, then you have to edit the notebook:
Base parameter name |
---|
yearWiki |
monthWiki |
dayWiki |
Once the settings are complete, click on "Publish All"
Now we will test our pipeline. Click on the "Debug" button:
The parameter page opens. Fill in the fields with consistent values and click on the "Finish" button:
Execution can be monitored in the "Output" part of the pipeline:
On the side of Azure Databricks, the notebook is run via a job
Hopefully on the Azure Data Factory side, your pipeline should show "Succeeded" in the "Status" column
We will now use Power BI Desktop to connect to our SQL Database. At the time of writing this article (17/03/2019), the Power BI connector for Azure Data Lake Gen2 Store is not yet available. From Power BI Desktop, click the "Get Data" button, and select "SQL Server"
Fill in the information from your database server, which can be found on the Azure portal as shown below:
Choose whether you want to import the data or do use DirectQuery mode. Then click on the "Ok" button.
Choose how you want to connect to your database. In this example, I chose a connection with a "Database" account.
Click on the "Connect" button
Select the table "WikipediaLogs", then click on the "Load" button
The fields will appear on the right side of Power BI Desktop.
From now on, you have all the tools to create a nice report on the result of the analyzes of the Wikipedia logs. Here is an example of a report:
A good thing is to be able to have the report updated automatically and have a daily refresh to get Wikipedia data. We can achieve that with 2 steps • Schedule daily Azure Data Factory pipeline execution • Automatic Power BI dataset daily refresh
Now, wouldn't it be nice to have our pipeline run every day to feed our database and get the report up to date? Azure Data Factory has the capability to define triggers to be executed according rules and time you can defined.
Go back to Azure Data Factory and jump into your pipeline.
Click on "Add trigger" and "New/Edit"
A new pane appears on the right. Click on the little arrow to open the drop-down list and click on "+New"
Define the trigger parameters to fire one time a day like shown in the screenshot below. Then click on "Next" button:
In the Edit Trigger pane, enter parameters like showns below and click "Finish"
Parameter | Value |
---|---|
Year | @formatDateTime(utcnow(),'yyyy') |
Month | @formatDateTime(utcnow(),'MM') |
Day | @adddays(utcnow(),-1,'dd') |
Hours | [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23] |
(The pipeline may fail at the beginning of each month. I need to solve the issue with the first day of each month and the delay between logs availability on Wikipedia site and UTC time)
Now, your trigger is created and can be monitored in the Azure Data Factory "Monitor" section
If you need more Analytics capability, you can also use the preview feature "Azure Data Factory Analytics (Preview when I'm writting this article (July 29,2019))"
Here a sample of monitoring my pipeline with "Log Analytics"
If I click on the error indicator, I can see clearly when and where the error occurs
I highly recommend to have a look to this article.
Concerning our Data Lake Gen2 storage, we will secure it by using the Advanced Threat Protection feature.
From the Azure Portal, click on your Data Lake Gen2 resouce and then click on Advanced security. Then click on "Enable Advanced Thread Protection"
A new window appears and will give you recommendations an security alerts. You can also get all recommendations by clicking on "View all recommendations in Security Center" button.
To secure your SQL database, you can use the features listed below:
- Always Encrypted (we will see this feature later in the article)
- Dynamic Data Masking
- Row Level Security
- Advanced Data Security
This article will give you a good overview of what Dynamic Data Masking is.
You can setup dynamic data masking through Azure portal. From your SQL Database resource, click on "Dynamic Data Masking"
Click on "Add mask" button
Choose a column and define a mask and click on the "Add" button
The new field is ready to be masked. To vailidate, click on "Save" button.
You can setup dynamic data masking with T-SQL script. More info in this article
You can also setup dynamic data masking with Powershell or REST API. more details in this article
Whichever option you pick to implement masking, it is very important to understand the distinctions of different masking functions as explained in this article
Row-Level Security enables you to use group membership or execution context to control access to rows in a database table.
This article will give more more details on this feature
You can use this article's examples with the SQL database you have created in this article.
Through the Azure Portal, you can leverage Advanced Data Security capabilities, a database level, for:
- Data Discovery and Classification
- Vulnerability Assessment
- Advanced Threat Protection
At the Azure SQL server level, you can also define periodic reccuring scans and get the repport through email
Below a screeshot of a mail I received each week:
You can activate the auditing feature. You can choose the log destination between:
- Storage account
- Log Analytics
- Event Hub (for real time log analysis scenario, for instance)
If you choose log Analytics, this below a sample of what you can get via Azure Monitor
The solution we created is based on a high level pattern that can be adapted based on the project requirements. For example, what if we had to process massive amounts of data in our SQL database? To help you with DP 200, let's change a little bit the architecture to introduce Azure SQL Data Warehouse with Polybase. Pay attention to the sentences with the white rabbit and my advice would be "Follow the white rabbit" ;)
[Azure SQL Data Warehouse] (https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-overview-what-is ) (our Massively Parallel Processing (MPP) Enterprise Data Warehouse) has a cool concept called "Polybase". Polybase allows to create external table that just hold the schema but the data stay in it's original storage. In our case, we will create an Azure Data Warehouse service, and create external table with the wikipedia data schema that points to our datalake (in the result zone).
From Azure portal, click on the plus sign, "Databases" then "SQL Data Warehouse"
Fill the form to create the data warehouse. Choose the SQL server you created previously and select the performance level to the minimum, it will be enough for our test. Click on "Review + create" button.
It should takes few minutes. When done, click on "Go to resource" button.
Now, we will configure Polybase to access the data stored in the data lake
You can connect to your Datawarehouse with several tools like
- SQL Server management Studio (SSMS)
- Visual Studio
- Azure Data Studio
In our case, we will use SSMS. See below the connection windows. (If you can't connect to your server, check the Azure SQL server firewall settings).
In the server name field, enter the name of SQL server on which you deployed your data warehouse. in our case, the server is called "fallinlove2nite.database.windows.net". For authentication, use the one you created during the server creation step above.
===================================================================
You can get the server name from the portal, in the SQL data warehouse resource "Overview" screen
===================================================================
Below the sequence in order to configure Polybase
- Optional: Create a user for loading data
- Create a Master Key
- Create a database scope credential
- Create an external data source
- Create External File Format
- Optional: Create schema for external table
- Create external table
- Optional: Create statisitics
- All the steps above is well detailed in this article
If not yet created in your SQL Server, create a new master key
CREATE MASTER KEY;
At the date I'm writing this article (July 2019), Polybase can connect to Data Lake Gen2 only with account storage key.
You can use whatever you want for IDENTITY in the script below:
CREATE DATABASE SCOPED CREDENTIAL ADLSCredential WITH IDENTITY = 'user', SECRET = '<azure_storage_account_key>' ;
Now we need to create an external data source to our Data Lake Gen2. Below, generic code to connect to ADLS Gen2
CREATE EXTERNAL DATA SOURCE AzureDataLakeStorage WITH ( TYPE = HADOOP, LOCATION='abfs[s]://<container>@<AzureDataLake account_name>.dfs.core.windows.net', -- Please note the abfss endpoint for when your account has secure transfer enabled CREDENTIAL = ADLSCredential );
In our case, T_SQL script look like:
CREATE EXTERNAL DATA SOURCE AzureDataLakeStorage WITH ( TYPE = HADOOP, LOCATION='abfss://wikipedia@thevaultgen2.dfs.core.windows.net', CREDENTIAL = ADLSCredential );
To be able to query our data, we need to specify the external data format
CREATE EXTERNAL FILE FORMAT TextFileFormat WITH ( FORMAT_TYPE = DELIMITEDTEXT , FORMAT_OPTIONS ( FIELD_TERMINATOR = '|' , STRING_DELIMITER = '' , DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff' , USE_TYPE_DEFAULT = FALSE ) );
in our case
CREATE EXTERNAL FILE FORMAT TextFileFormat WITH ( FORMAT_TYPE = DELIMITEDTEXT , FORMAT_OPTIONS ( FIELD_TERMINATOR = ',') );
Now we will create the external table to query our data stored in the data lake. In our case this is the T-SQL script:
CREATE EXTERNAL TABLE [dbo].[Wikipedia_external] ( [Article] [nvarchar](500) NULL, [Language] [nvarchar](10) NULL, [Nb Hit] [int] NULL, [Year] [nvarchar](10) NULL, [Month] [nvarchar](10) NULL, [Day] [nvarchar](10) NULL ) WITH ( LOCATION='/wikipedia_results/' , DATA_SOURCE = AzureDataLakeStorage , FILE_FORMAT = TextFileFormat , REJECT_TYPE = Percentage , REJECT_VALUE = 90 , REJECT_SAMPLE_VALUE = 200 );
Now you can query your external table. below a very simple query to get the result directly from our data lake
Select * from [Wikipedia_external]
To complete your skills with polybase, I highly recommend reading this article
It could be a good idea to monitor your Azure SQL Data Warehouse, especially the cache usage. For example you can track DWU Used metric to know if your database needs more compute capacity.
This is an article I highly recommend to read .
And find a room in your brain to store the table below:
cache hit and used percentage section
There are several ways to backup SQL data, but 2 of them deserves our attention: Bacpac and Dacpac. Below a quick description
-
A BACPAC is a Windows file with a .bacpac extension that encapsulates a database's schema and data . The primary use case for a BACPAC is to move a database from one server to another - or to migrate a database from a local server to the cloud - and archiving an existing database in an open format.
-
A DACPAC is focused on capturing and deploying schema, including upgrading an existing database. The primary use case for a DACPAC is to deploy a tightly defined schema to development, test, and then to production environments. And also the reverse: capturing production's schema and applying it back to test and development environments. In other words, DACPAC only contains the schema, not the data.
Maybe you need to get information about your SQL database edition and ServiceObjective
To get the information, you can use the following T-SQL script in SSMS
SELECT DATABASEPROPERTYEX('Musicology','Edition') AS Edition , DATABASEPROPERTYEX('Musicology','ServiceObjective') AS ServiceObjective
And maybe you need to modify the Service Objective via T-SQL, in this case you have to use SERVICE_OBJECTIVE argument with the T-SQL script below:
ALTER DATABASE [Musicology] MODIFY(EDITION = 'standard', MAXSIZE = 100 MB, SERVICE_OBJECTIVE = 'S0');
(You have to wait few minutes before the changes occur)
If you run again the first T-SQL script, you can see the new edition and Service Objective
From this article, there is a good definition of what always encrypted is:
"Always Encrypted is a feature designed to protect sensitive data, such as credit card numbers or national identification numbers (for example, U.S. social security numbers), stored in Azure SQL Database or SQL Server databases. Always Encrypted allows clients to encrypt sensitive data inside client applications and never reveal the encryption keys to the Database Engine ( SQL Database or SQL Server). As a result, Always Encrypted provides a separation between those who own the data (and can view it) and those who manage the data (but should have no access). By ensuring on-premises database administrators, cloud database operators, or other high-privileged, but unauthorized users, cannot access the encrypted data, Always Encrypted enables customers to confidently store sensitive data outside of their direct control."
During the process to encrypt your data you will have to choose between determinic or randomized encryption.
-
Deterministic encryption always generates the same encrypted value for any given plain text value. Using deterministic encryption allows point lookups, equality joins, grouping and indexing on encrypted columns. However, it may also allow unauthorized users to guess information about encrypted values by examining patterns in the encrypted column, especially if there is a small set of possible encrypted values, such as True/False, or North/South/East/West region. Deterministic encryption must use a column collation with a binary2 sort order for character columns.
-
Randomized encryption uses a method that encrypts data in a less predictable manner. Randomized encryption is more secure, but prevents searching, grouping, indexing, and joining on encrypted columns.
-
Use deterministic encryption for columns that will be used as search or grouping parameters, for example a government ID number.
-
Use randomized encryption, for data such as confidential investigation comments, which are not grouped with other records and are not used to join tables.
From SSMS, right click the WikipediaLogs table and select "Encrypt Columns..."
The introduction window appears, click on Next
Choose the columns you want to encrypt and the encryption type. Click on the Next button
According your security policies, you can store your key either in Windows certificate store or in Azure Key Vault. Click on the Next button
On the Run Settings window, click on the Next button
On the Summary window, clik on Finish button
If all is ok, you should get a result like the one in the screenshot below
Now, if you query your table, you can see that the data in the Request column are encrypted
In the case you have to synchronize bi-directionally data across multiple SQL sources (on-premise or in the Cloud), it's a good idea to read this article
It’s possible to automate pipeline creation with PowerShell. Below the sequence to create an Azure Data Factory pipeline with PowerShell. (order matters)
- optional: create a resources group
- Data Factory creation: Set-AzDataFactoryV2
- Linked Services creation: Set-AzDataFactoryV2LinkedService
- Datasets creation: Set-AzDataFactoryV2Dataset
- Pipeline creation: Set-AzDataFactoryV2Pipeline
- Run the pipeline: Invoke-AzDataFactoryV2Pipeline
- Get details about the run: Get-AzDataFactoryV2ActivityRun
More details are availble in this article
Azure Data Factory relies on Integration Runtime, a compute infrastructure, to provide several data integration capabilities like:
- Data Flow: Execute a Data Flow in managed Azure compute environment.
- Data movement: Copy data across data stores in public network and data stores in private network (on-premises or virtual private network). It provides support for built-in connectors, format conversion, column mapping, and performant and scalable data transfer.
- Activity dispatch: Dispatch and monitor transformation activities running on a variety of compute services such as Azure Databricks, Azure HDInsight, Azure Machine Learning, Azure SQL Database, SQL Server, and more.
- SSIS package execution: Natively execute SQL Server Integration Services (SSIS) packages in a managed Azure compute environment.
There are several Integration Runtime types
IR Type | Public network | Private network |
---|---|---|
Azure | Data Flow | |
Data movement | ||
Activity dispatch | ||
Self-hosted | Data movement | Data movement |
Activity dispatch | Activity dispatch | |
Azure-SSIS | SSIS package execution | SSIS package execution |
More information can be found in this article
Previously in this article, we invoked Azure Databrick's notebook via Azure Data Factory. But you can also use the notebook with interactive cluster.
From your Azure Databricks's workspace, on the left, click on "Clusters" then on "Create Cluster"
Fill the creation cluster form
You have to choose for a cluster mode
- Standard clusters are the default and can be used with Python, R, Scala, and SQL. Standard clusters are configured to automatically terminate after 120 minutes.
- High-concurrency clusters are tuned to provide the efficient resource utilization, isolation, security, and the best performance for sharing by multiple concurrently active users. High concurrency clusters support only SQL, Python, and R languages. High concurrency clusters are configured to not terminate automatically.
Select the worker type and driver type you need
-
Workers type run the Spark executors and other services required for the proper functioning of the clusters. When you distribute your workload with Spark, all of the distributed processing happens on workers.
-
Driver type maintains state information of all notebooks attached to the cluster. The driver node is also responsible for maintaining the SparkContext and interpreting all the commands you run from a notebook or a library on the cluster. The driver node also runs the Apache Spark master that coordinates with the Spark executors. The default value of the driver node type is the same as the worker node type. You can choose a larger driver node type with more memory if you are planning to collect() a lot of data from Spark workers and analyze them in the notebook.
By default the driver node uses the same instance type as the worker node
Note: have in mind that Azure Databricks bills you for virtual machines (VMs) provisioned in clusters AND Databricks Units (DBUs) based on the VM instance selected. A DBU is a unit of processing capability per hour. More details in this article
Few minutes after clicking the create cluster, you should get a new cluster running
On the left, click on "Recents" button and select the notebook
You need to attach the notebook to a cluster. Click on "Detached" and select a cluster
You can run the whole notebook or cell by cell.
If you want to run all the notebook, click on "Run All" button
If you want to run the notebook cell by cell, click on a cell and hit the "Shift" + "Enter" key or click on the arrow on the top left pf the cell
Before running the notebook, you have to enter parameters. (If widgets are not available, run cmd4 by hiting "Shift" + "Enter")
WARNING: in our case, to test the notebook with interactive cluster, be sure you have files in the demo_datasets folder in your data lake, set the parameters according the files in the folder, do it cell by cell and start from the cell cmd4
I know that is a very long article, but I think it helps you to build your knowledge on our data platform to maximize your chances with the DP200 exam (and maybe the DP201 ;)). I also recommend having a look on the following topics that I don't cover in this article:
- Azure Cosmos DB (have a special read on how create collections and how to deal with partition)
- A read on HDInsight (Storm, MapReduce and Spark)
- File type (Avro, Parquet, ORC)
- Real Time data (Stream Analytics, Event hub, Edge)
- SQL Data Warehouse distributed tables