Build an Analytic solution in Azure

This long article will help you to prepare for passing DP-200 exam. The first part of the article is about all the steps to build an analytic solution from scratch. And bonus, you will also get a cool demo to showcase to your customers.

The second part of this article provides some tips and tricks to help you pass the exam. I highly recommend to “follow the white rabbit” .

Build an Analytic solution in Azure

To build our solution, we will use several services of our data platform

Azure Data Lake Gen 2, for storage
Azure Data Factory, for orchestrating the production pipeline
Azure Databricks, for the injection of innovations, experimentation and transformations
Azure SQL Database for the data presentation layer
Power BI Desktop, for data mining and reporting

Of course we will add some infrastructure components as well:

Azure Active Directory
Azure Key Vault

Before we begin, let's do a quick review of Azure Data Factory and Azure Databriks

Azure Data Factory v2

Azure Data Factory v2 is a fully managed cloud service that enables the orchestration of data movement and transformation processes. Completely integrated with Azure, it can invoke other services during the data movements, such as Azure Databricks or Logic App, to transform and enrich your data. Azure Data Factory v2 also integrates with Azure Devops for integration and continuous development. In addition, with the ability to transform pipelines into ARM models, it makes very easy to deploy them to other environments.

Azure Databricks

Based on Spark, Azure Databricks is a workspace that fosters collaboration between the company's stakeholders to bring innovation through continuous experimentation. Being integrated with Azure DevOps, Azure Databricks fits neatly in a CI / CD process as illustrated in Benjamin Leroux's post. Azure Databricks allows the collaboration around Notebooks, which support several languages such as Python, Scala, SQL, ..., and allows the execution of these notebooks both manual and automatic, via "jobs", orchestrated by Azure Data Factory.

The notebook used for this example will do the following:

Read files deposited by Azure Data Factory into the data lake (Azure Data Lake Storage Gen2)
Process files to extract information
Create an archive file in the data lake
Insert the results into a SQL Database table for the visualization

The notebook used for this example is available here or here

Few words about the data lake

In modern data platform architecture solutions, it is important to separate storage space from data processing engine. This is exactly what we will do in this example. For this article, we will develop our data lake in several zones: one to to preserve the original state of the data, one staging (temp area), and one to store the end result of the transformations as shown in the illustration below

Demonstration architecture

At the end of this article we will have implemented the following architecture. You will notice the separation of the storage from the calculation engine

Prerequisites

Creating the solution

Creating the resource group

A resource group is a logical grouping of Azure services. For this article, we will group all the services used to build our solution in one resource group. Nothing prevents you from organizing your Azure services in a different way.

From the Azure portal, on the left, click on "Resource groups" then on the "Add" button.

Complete the creation form and click on "Review + Create". For this article, we will name the resource group "Paisley-Park".

Creation of the data lake

In the previously created resource group, we will add a storage service. For this example we will choose the new service "Data Lake Store Gen 2".

From the Azure portal, click on "Create a resource", "Storage" and then "Storage account"

Then fill out the creation form. Here, I decide to create my data lake in the "Paisley-Park" resource group.For this article, I will name my data lake "thevaultgen2"

Then click on the "Advanced" tab, then in the "Data Lake Storage gen2" section, click on "Enabled". This will enable new features related to the new version of our data lake.

If necessary you can add tags by clicking on the tab "Tags". Tags are useful for administering and tracking the costs of your Azure services.

When done, click on the "Review + create" button

Then on the "Create" button

Creating the SQL Database Database

From the Azure portal, click on "Create a resource", "Database", then "SQL Database"

Fill out the database creation form and click on "Review + Create". For this article, I will call the database "Musicology", and I will keep the level of Tier to basic. Nothing will prevent us subsequently to change this level of performance according to the needs (principle of elasticity). (This change can also be done programmatically by using SERVICE_OBJECTIVE argument, but we will discuss this point later in the article )

On the "Create + Review" page, click on the "Create" button

Once the database is created, consider setting up the "firewall" in case you want to use SQL Server Management Studio from a remote desktop. Refer to the documentation .

Azure Data Factory Service Creation

From the portal, click on "Create a resource", "Integration" and then "Data Factory"

Fill out the creation form and click on the "Create" button

Creation of the Azure Databricks service

From the Azure portal, click "Create a resource", "Analytics" and "Azure Databricks"

Fill out the creation form. We will take the premium tier for integration with Azure Key Vault. The premium tier also supports role-based access control, refer to pricing tiers.

This is especially recommended in the Databricks documentation: "Your account must have the Azure Databricks Premium Plan for you to be able to select Creator. This is the recommended approach: grant MANAGE permission to the Creator when you create the secret scope, and then assign more granular access permissions after you have tested the scope. For an example workflow, see Secret Workflow Example."

For this article, I will name my Azure Databricks "CrystalBall" workspace.Click on the "Create" button

Creating an Azure Key Vault

From the Azure portal, click on "Create a resource", then in the search bar enter "Key Vault"

Choose "Key Vault" from the Marketplace and click on the "Create" button.

Fill out the creation form and click on the "Create" button

If you have placed all your resources in the same resource group, you will see a result similar to the screenshot below:

##Creation of an application and a service principal

In this section of the article, we will create the security elements that allow Azure Databricks to access our Data Lake gen2. The general procedure can be found here.

In our Azure Active Directory, we will create and register an application.

From the Azure portal, click on "Azure Active Directory" and then on "App registration".

Click on the "New registration" button

Fill out the registration form and click on the "register" button

The application is now created. You must have a screen identical to the one below.

**WARNING!!! ** Note the information of your "Application (client) ID" as well as your "Directory (tenant) ID" (which I have hidden in my screenshot). You will need it later.

Click on "Certificates & Secrets" and then on "New Secret Client". This "secret ID" is also called "authentication-id" or "secret client"

Fill out the creation form and click on the "Add" button

Copy the previously generated secret by clicking on the copy icon

Return to your resource group and select your storage account

Then click on "Access Control (IAM)" (Actually, we are in the process to define access management for Cloud resources. To manage access control, we use what we called Role-based access control (RBAC). This article will give you a good overview of what RBAC is. )

In the "Add a role assignment" box, click on the "Add" button

In the "Add role assignment" window, select the following items:

Role: Storage Blob Data Contributor
Select: The name of the application created previously. Here, "FranmerTheVaultGen2".

Then click on the "Save" button

Setting the Azure Key Vault

In order for Azure Databricks to have access to Data Lake Gen2, we will need the following information:

storage account name
Application-id
authentication-id
tenant-id

As data engineers and scientits work in notebooks, the best practice is to avoid having secret information appear in clear text in the code. This is where Azure Key Vault comes into action to protect passwords or logins. For this example we will store only the different IDs.

Click on your Azure Key Vault:

Click on "Secrets" then on the "Generate / Import" button

We will create the secrets for our 3 IDs. Below an example for "application-id".

Click on the "Create" button. Repeat the following steps for the other 2 IDs.

After creating your 3 secrets, you must have a result like the screenshot below:

A little further down in this article, we'll add a secret for the integration between Azure Data Factory and Azure Databricks

Configuring Azure Databricks with Azure Key Vault

Integration with Azure Key Vault

We will now configure Azure Databricks so that it can use the secrets stored in Azure Key Vault. For information, the complete documentation is here.

Return to your resource group and click on your Azure Databricks service:

Click on the button "Launch Workspace"

Once in your workspace, note the URL of your Azure Databricks:

Then compose a new URL by adding "#secrets/createScope" (Attention to the capital "S"!) In my case the URL will be: ("https://eastus2.azuredatabricks.net#secrets/createScope")

The next window will open, this is where we will create references to the secrets present in our Azure Key Vault

To do this, we will need the following information from our Azure Key Vault:

DNS Name
Resource ID

This information can be found in the properties of Azure Key Vault (see screenshot below)

Once the information is found, the creation of a "Secret Scope" is as shown below. After entering all the required information, click on the "Create" button

If all goes well, you must have the following message. Click on the "OK" button to validate.

Import Databricks's notebook

In this next section, we will download the notebook (DatabricksNotebook_Wikipedia_ADLSGen2_Generic.dbc) which is at the following address: https://1drv.ms/u/s!Am-C-ktMH9lgg9MCqqG5dS8XKFKxDA or https://github.com/franmer2/HelpForDP200/tree/master/resources.

Note that Databricks notebooks have the DBC file extension.

You will then have to modify the notebook to add your information concerning the application ID, the "data lake" as well as your SQL Database.

From your Azure Databricks workspace, navigate to where you want to import the notebook. Click on the down arrow to bring up the context menu and click on "Import".

Check the "File" box, enter the path to the notebook and click on the "Import" button

If all goes well, the notebook should be in your workspace as shown below

========================================

Later in this article, we will see how to use this notebook with interactive cluster. It would be a good option to test the notebook before you run it in production through Azure Data Factory, especially if you changed the names of the Azure resources, data lake zones/layout, or Key Vault secrets.

========================================

Creating access tokens

We will create the access tokens for Power BI and Azure Data Factory. At the top right of your workspace, click on the character icon and then on "User Settings"

Click on the "Generate New Token" button

Give a description to the Token, then indicate its lifetime. Click on the "Generate" button.

WARNING !! Copy your Token in a reliable place, because after pressing the "Done" button, it will no longer be accessible.

Repeat to create a Token for Power BI. You must have a result similar to the screenshot below:

Note: The token for Azure Databricks needs to be added in Azure Key Vault. We will add the secret "Databricks"

=============================================================

Below is a screenshot after adding the "Databricks" secret in Azure Key Vault:

=============================================================

Landscaping of our data lake

We will now develop our data lake to create an area for raw data from Wikipedia, and an area with the result of the analyzes

The zone "demo_datasets" will be created manually. The "wikipedia_results" zone will be created automatically by Azure Databricks. In order to interact with Azure Data Lake Gen2, you can use Azure Storage Explorer. From Azure Storage Explorer, sign in to your Azure account. Find your lake of data, do a right click on it, then create a container "wikipedia".

You must obtain a result similar to the screenshot below:

OPTIONAL (Azure Data Factory can create the folder automatically if it is not already present). In your "wikipedia" container, create the directory "demo_datasets".

Click on the "New Folder" button

You should obtain the result below:

Creating the Azure Data Factory pipeline

The pipeline is quite simple for this example and will consist of the sequence of the following activities, while being based on input parameters (which is, by the way, one of the elements of DataOps):

Test the presence of the "demo_datasets" folder in the storage account
Delete files from the raw area (demo_datasets) of the data lake
Download log files (depending on user settings)
Invoke an Azure Databricks notebook for data transformation and writing of results to another area of the data lake and in SQL Database.

In the end, we will get a pipeline like the one shown below:

From the Azure portal, connect to your Azure Data Factory service

Click on "Author & Monitor"

(If you use MSDN subscription and facing infinite white screen, try to open your browser "in private" mode)

Creating linked services and datasets

Creating linked services

We will create 4 Linked Services for:

Azure Data Lake storage Gen2
The Wikipedia log site
Azure Key Vault
Azure Databricks

In this section of the article, we will create 2 Linked Services. The other 2 will be created during dataset creation, to illustrate another way to create them.

Azure Key Vault Linked Service

We will start by creating a Linked Service to the "Azure Key Vault" service.

From the authoring interface, click on the pencil on the left, then on "Connections"

Click on the "New" button

Search and select "Azure Key Vault". Click on the "Continue" button

Enter information about your Azure Key Vault service. Click on "Test connection" then on "Finish"

=============================================================

WARNING !!! You noticed the "Edit Key vault" message just below the "Azure key vault name" field. It is necessary to add the "Service identity application" in the "Access policies" section of the Azure Key Vault service.

In your Azure Key Vault service, click "Access policies" and then the "Add new" button to search for the service (with the Service identity application number) and give it access rights.The "GET" right for the "Secret permissions" field is sufficient

=============================================================

After creating your linked service, you should get a result similar to the one shown below:

Click on the "Publish all" button

Azure Databricks Linked Service

We will repeat the operation to create the Linked Service to Azure Databricks

Fill in the login information for Azure Databricks. Use the secret we created earlier in Azure Key Vault

To illustrate another possibility of creating linked services, we will create the other 2 linked services at the same time as the datasets in the next paragraph.

Creating datasets

We will now create 3 "datasets":

1 to retrieve Wikipedia logs (http), also with parameters
1 for the data lake
1 for the same data lake but with parameters

The data type for all these files is binary.

Creating the dataset for the Wikipedia logs site

Beginning of the creation of the Wikipedia dataset

Click on the "+" sign, then on "Dataset"

In the "New Dataset" section, look for "http", then click on the "http" icon and then on the "Finish" button

In the "General" tab, define a name for your dataset.

Then click on "Parameters".Click on the "New" button each time to add a parameter

Enter the following parameters in "String" format:

YearDS
MonthDS
DayDS
HourDS

Click on the "Connection" tab

Creating the Wikipedia Linked Service

Here we will see how to create a Linked Service from the creation phase of the dataset. In the "Connection" section click on "New" to create a linked service to the Wikipedia logs site

Give a name to your linked service. In the field, base URL, enter the following URL: https://dumps.wikimedia.org/other/pageviews/.

Click on "Disable" and select "Anonymous" in the "Authentication type" fieldTest your connection and click on the "Finish" button

Once the linked service is created, you return to creating your dataset.

Continuig Wikipedia dataset creation

Click in the field "Relative URL" then on "Add dynamic content"

Copy the string below into the "Add Dynamic Content" field and click on "Finish" (you may have an error after the copy/paste, in this case, double check the quotes (')):

@concat(dataset().YearDS,'/',dataset().YearDS,'-',dataset().MonthDS,'/pageviews-',dataset().YearDS,dataset().MonthDS,dataset().DayDS,if(less(int(dataset().HourDS),10),'-0','-'),dataset().HourDS,'0000.gz')

Then check the "Binary copy" box

Click on the "Publish All" button

Creating a Data Lake dataset

Dataset creation for the data lake

Now, let's create a second Dataset for our data lake. Repeat the creation procedure as previously seen, then select "Azure Data Lake Storage Gen2"

Give a name to this new dataset.

Creating a Data Lake linked service

Click on "Connection" and then on the "New" button to create a linked service to our data lake.

In the "New Linked Service" window, select your storage account from your Azure subscription:

The "Tenant" field should automatically be filled.
The "Service Principal ID" is the one we created at the beginning of the article
Select "Azure Key Vault"
Select your "Azure Key Vault" service
Indicate the name of the secret that protects the key of your main service (in my example: "authentication-id")

Click on the "Finish" button

Continuing Data Lake dataset creation

Now that the linked service to our data lake has been created, we will continue configuring our dataset. In the "File path" field, we will enter the path "Wikipedia / demo_datasets".

Check the "Binary copy" box

Creating a parametrized Data Lake dataset

We will repeat the following procedure to create a dataset on the same data lake, but by adding parameters. We will use the same linked service. On the second dataset of the data lake add the following parameters:

YearDS
MonthDS
DayDS
HourDS

Then in the "Connection" tab, define the path as shown below by adding the following expression in the "File" field:

@concat('pageviews-',dataset().YearDS,dataset().MonthDS,dataset().DayDS,if(less(int(dataset().HourDS),10),'-0','-'),dataset().HourDS,'0000.gz')

At the top of the screen, click on the "Publish all" button.

Pipeline creation

Click on the "+" sign and then on "Pipeline"

Pipeline settings

In order to make our pipeline more agile, we will add parameters, which will allow us, for example, to process the logs according to the date and times that interest us.

Click in a free space in the Pipeline Editor, then click on "Parameters" then on "New" (click 4 times on "New" to add 4 parameters)

Add the following parameters:

Nom	Type
Year	String
Month	String
Day	String
Hours	Array

In the example, I added some default values for the example (especially in the Array parameter). But these default values are not required.

Creation of the "Delete" activity

As this activity does not exist yet in the list of available activities, we will create it by code. (Edit: the delete activity is now available through the UI, but you can still do it via the code just for training )

Click on the "Code" button to display the code editor

Copy and paste in the editor the script below. Remember to change the name of the dataset. (In my example it is "AzureDataLakeStorageFile1", the data lake dataset without parameters).

Click on the "Finish" button

{
    "name": "pipeline1",
    "properties": {
        "activities": [
            {
                "name": "Delete Activity",
                "type": "Delete",
                "policy": {
                    "timeout": "7.00:00:00",
                    "retry": 0,
                    "retryIntervalInSeconds": 30,
                    "secureOutput": false,
                    "secureInput": false
                },
                "typeProperties": {
                    "dataset": {
                        "referenceName": "<Your Dataset Name>",
                        "type": "DatasetReference"
                    },
                    "recursive": true,
                    "enableLogging": false,
                    "maxConcurrentConnections": 1
                }
            }
        ]
    }
}

You should see a new activity in your pipeline:

"Get Metadata" activity

The screenshot below show how to create the 'Get Metadata'activity.

Add the "Exists" argument. It will be tested by another activity

"If Condition" activity

Drag and drop the "If Condition" activity from the left menu to the pipeline. Add a link between the "Get Metadata" activity with the "If Condition" activity.

Click on the "If Condition" activity. In "settings" tab, clic on "Add dynamic content"

Add the following content:

@activity('Get Metadata1').output.Exists

Right click on the "Delete" activity and click on "Copy"

Click again on the "If Condition" activity and then on the "Activities" tab.Then click on the "Add if True Activity" button

Right click in the central part to copy the activity. Click on "Paste"

The "Delete" activity is copied to your "If condition" activity.

Click on the "Pipeline1" (or another name if you renamed your pipeline) link to return to the top level of your pipeline and remove the "Delete1" activity.

"ForEach" activity

We will now create a loop to download the files corresponding to the number of hours that we want to recover. This number will be defined via the parameters.

From the list of activities, on the left, drag and drop the activity "ForEach" in the editor. Then link the "If condition" activity to the "ForEach" activity.

Click on the "ForEach" activity and then "Settings". Click in the "Items" field then click on "Add dynamic content"

Add the following content:

@pipeline().parameters.Hours

Then click on the "Finish" button

The "Hours" parameter is now available in the "Item" section of the "ForEach" activity. This loop will now be repeated according to the number of hours that will be defined in the "Hours" parameter.

Double click on the activity "ForEach", to add activities inside the loopAdd the "Copy Data" activity in the "ForEach" loop

Click on "Source" and select the Dataset "Logs_Wikipedia". Pass the settings as shown in the screenshot below.

YearDS : @pipeline().parameters.Year
MonthDS : @pipeline().parameters.Month
DaysDS : @pipeline().parameters.Day
HourDS : @item()

(Concerning the hours, parameter comes actually from the activity "ForEach").

Click on "Sink" tab. In the "Sink Dataset" field, choose the data lake dataset with the parameters.

Then fill in the parameters with the values as shown in the screenshot below.

Once the changes are complete, click on the "Publish All" button

"Notebook" activity

Now we're going to add Databricks's Notebook Activity as shown below

Once the activity has been added, link it on the output of the "For Each" activity

Click on the "Notebook" activity, then on "Azure Databricks". Choose the linked service created previously and click on "Test connection":

Then click on "Settings" and then on the "Browse" button

Then click on "Base Parameters" to create the parameters that will be passed to your Azure Databricks Notebook.

Respect the following names. If you want to change the names of the basic parameters, then you have to edit the notebook:

Base parameter name
yearWiki
monthWiki
dayWiki

Once the settings are complete, click on "Publish All"

Now we will test our pipeline. Click on the "Debug" button:

The parameter page opens. Fill in the fields with consistent values and click on the "Finish" button:

Execution can be monitored in the "Output" part of the pipeline:

On the side of Azure Databricks, the notebook is run via a job

Hopefully on the Azure Data Factory side, your pipeline should show "Succeeded" in the "Status" column

Power BI

We will now use Power BI Desktop to connect to our SQL Database. At the time of writing this article (17/03/2019), the Power BI connector for Azure Data Lake Gen2 Store is not yet available. From Power BI Desktop, click the "Get Data" button, and select "SQL Server"

Fill in the information from your database server, which can be found on the Azure portal as shown below:

Choose whether you want to import the data or do use DirectQuery mode. Then click on the "Ok" button.

Choose how you want to connect to your database. In this example, I chose a connection with a "Database" account.

Click on the "Connect" button

Select the table "WikipediaLogs", then click on the "Load" button

The fields will appear on the right side of Power BI Desktop.

From now on, you have all the tools to create a nice report on the result of the analyzes of the Wikipedia logs. Here is an example of a report:

A good thing is to be able to have the report updated automatically and have a daily refresh to get Wikipedia data. We can achieve that with 2 steps • Schedule daily Azure Data Factory pipeline execution • Automatic Power BI dataset daily refresh

Schedule daily Azure Data Factory pipeline execution

Now, wouldn't it be nice to have our pipeline run every day to feed our database and get the report up to date? Azure Data Factory has the capability to define triggers to be executed according rules and time you can defined.

Go back to Azure Data Factory and jump into your pipeline. Click on "Add trigger" and "New/Edit"

A new pane appears on the right. Click on the little arrow to open the drop-down list and click on "+New"

Define the trigger parameters to fire one time a day like shown in the screenshot below. Then click on "Next" button:

In the Edit Trigger pane, enter parameters like showns below and click "Finish"

Parameter	Value
Year	`@formatDateTime(utcnow(),'yyyy')`
Month	`@formatDateTime(utcnow(),'MM')`
Day	`@adddays(utcnow(),-1,'dd')`
Hours	`[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]`

(The pipeline may fail at the beginning of each month. I need to solve the issue with the first day of each month and the delay between logs availability on Wikipedia site and UTC time)

Now, your trigger is created and can be monitored in the Azure Data Factory "Monitor" section

If you need more Analytics capability, you can also use the preview feature "Azure Data Factory Analytics (Preview when I'm writting this article (July 29,2019))"

Here a sample of monitoring my pipeline with "Log Analytics"

If I click on the error indicator, I can see clearly when and where the error occurs

Tips and Tricks for the exam

Secure your storage layer

Data Lake Gen2

I highly recommend to have a look to this article.

Concerning our Data Lake Gen2 storage, we will secure it by using the Advanced Threat Protection feature.

From the Azure Portal, click on your Data Lake Gen2 resouce and then click on Advanced security. Then click on "Enable Advanced Thread Protection"

A new window appears and will give you recommendations an security alerts. You can also get all recommendations by clicking on "View all recommendations in Security Center" button.

Secure SQL Database

To secure your SQL database, you can use the features listed below:

Always Encrypted (we will see this feature later in the article)
Dynamic Data Masking
Row Level Security
Advanced Data Security

Dynamic Data Masking

This article will give you a good overview of what Dynamic Data Masking is.

You can setup dynamic data masking through Azure portal. From your SQL Database resource, click on "Dynamic Data Masking"

Click on "Add mask" button

Choose a column and define a mask and click on the "Add" button

The new field is ready to be masked. To vailidate, click on "Save" button.

You can setup dynamic data masking with T-SQL script. More info in this article

You can also setup dynamic data masking with Powershell or REST API. more details in this article

Whichever option you pick to implement masking, it is very important to understand the distinctions of different masking functions as explained in this article

Row Level Security

Row-Level Security enables you to use group membership or execution context to control access to rows in a database table.

This article will give more more details on this feature

You can use this article's examples with the SQL database you have created in this article.

Advanced Data Security

Through the Azure Portal, you can leverage Advanced Data Security capabilities, a database level, for:

Data Discovery and Classification
Vulnerability Assessment
Advanced Threat Protection

At the Azure SQL server level, you can also define periodic reccuring scans and get the repport through email

Below a screeshot of a mail I received each week:

Auditing

You can activate the auditing feature. You can choose the log destination between:

Storage account
Log Analytics
Event Hub (for real time log analysis scenario, for instance)

If you choose log Analytics, this below a sample of what you can get via Azure Monitor

Polybase

The solution we created is based on a high level pattern that can be adapted based on the project requirements. For example, what if we had to process massive amounts of data in our SQL database? To help you with DP 200, let's change a little bit the architecture to introduce Azure SQL Data Warehouse with Polybase. Pay attention to the sentences with the white rabbit and my advice would be "Follow the white rabbit" ;)

[Azure SQL Data Warehouse] (https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-overview-what-is ) (our Massively Parallel Processing (MPP) Enterprise Data Warehouse) has a cool concept called "Polybase". Polybase allows to create external table that just hold the schema but the data stay in it's original storage. In our case, we will create an Azure Data Warehouse service, and create external table with the wikipedia data schema that points to our datalake (in the result zone).

Create Azure SQL Data Warehouse

From Azure portal, click on the plus sign, "Databases" then "SQL Data Warehouse"

Fill the form to create the data warehouse. Choose the SQL server you created previously and select the performance level to the minimum, it will be enough for our test. Click on "Review + create" button.

It should takes few minutes. When done, click on "Go to resource" button.

Configure Polybase

Now, we will configure Polybase to access the data stored in the data lake

You can connect to your Datawarehouse with several tools like

SQL Server management Studio (SSMS)
Visual Studio
Azure Data Studio

In our case, we will use SSMS. See below the connection windows. (If you can't connect to your server, check the Azure SQL server firewall settings).

In the server name field, enter the name of SQL server on which you deployed your data warehouse. in our case, the server is called "fallinlove2nite.database.windows.net". For authentication, use the one you created during the server creation step above.

===================================================================

You can get the server name from the portal, in the SQL data warehouse resource "Overview" screen

===================================================================

Below the sequence in order to configure Polybase

Optional: Create a user for loading data
Create a Master Key
Create a database scope credential
Create an external data source
Create External File Format
Optional: Create schema for external table
Create external table
Optional: Create statisitics
All the steps above is well detailed in this article

Create a master key

If not yet created in your SQL Server, create a new master key

CREATE MASTER KEY;

Create a database scope credential

At the date I'm writing this article (July 2019), Polybase can connect to Data Lake Gen2 only with account storage key.

You can use whatever you want for IDENTITY in the script below:

CREATE DATABASE SCOPED CREDENTIAL ADLSCredential WITH IDENTITY = 'user', SECRET = '<azure_storage_account_key>' ;

Create the external data source

Now we need to create an external data source to our Data Lake Gen2. Below, generic code to connect to ADLS Gen2

CREATE EXTERNAL DATA SOURCE AzureDataLakeStorage WITH ( TYPE = HADOOP, LOCATION='abfs[s]://<container>@<AzureDataLake account_name>.dfs.core.windows.net', -- Please note the abfss endpoint for when your account has secure transfer enabled CREDENTIAL = ADLSCredential );

In our case, T_SQL script look like:

CREATE EXTERNAL DATA SOURCE AzureDataLakeStorage WITH ( TYPE = HADOOP, LOCATION='abfss://wikipedia@thevaultgen2.dfs.core.windows.net', CREDENTIAL = ADLSCredential );

Configure data format

To be able to query our data, we need to specify the external data format

CREATE EXTERNAL FILE FORMAT TextFileFormat WITH ( FORMAT_TYPE = DELIMITEDTEXT , FORMAT_OPTIONS ( FIELD_TERMINATOR = '|' , STRING_DELIMITER = '' , DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff' , USE_TYPE_DEFAULT = FALSE ) );

in our case

CREATE EXTERNAL FILE FORMAT TextFileFormat WITH ( FORMAT_TYPE = DELIMITEDTEXT , FORMAT_OPTIONS ( FIELD_TERMINATOR = ',') );

Create the external table

Now we will create the external table to query our data stored in the data lake. In our case this is the T-SQL script:

CREATE EXTERNAL TABLE [dbo].[Wikipedia_external] ( [Article] [nvarchar](500) NULL, [Language] [nvarchar](10) NULL, [Nb Hit] [int] NULL, [Year] [nvarchar](10) NULL, [Month] [nvarchar](10) NULL, [Day] [nvarchar](10) NULL ) WITH ( LOCATION='/wikipedia_results/' , DATA_SOURCE = AzureDataLakeStorage , FILE_FORMAT = TextFileFormat , REJECT_TYPE = Percentage , REJECT_VALUE = 90 , REJECT_SAMPLE_VALUE = 200 );

Now you can query your external table. below a very simple query to get the result directly from our data lake

Select * from [Wikipedia_external]

To complete your skills with polybase, I highly recommend reading this article

Azure Data Warehouse Monitoring

It could be a good idea to monitor your Azure SQL Data Warehouse, especially the cache usage. For example you can track DWU Used metric to know if your database needs more compute capacity.

This is an article I highly recommend to read .

And find a room in your brain to store the table below:

cache hit and used percentage section

Backup SQL and move it to another destination

There are several ways to backup SQL data, but 2 of them deserves our attention: Bacpac and Dacpac. Below a quick description

A BACPAC is a Windows file with a .bacpac extension that encapsulates a database's schema and data . The primary use case for a BACPAC is to move a database from one server to another - or to migrate a database from a local server to the cloud - and archiving an existing database in an open format.
A DACPAC is focused on capturing and deploying schema, including upgrading an existing database. The primary use case for a DACPAC is to deploy a tightly defined schema to development, test, and then to production environments. And also the reverse: capturing production's schema and applying it back to test and development environments. In other words, DACPAC only contains the schema, not the data.

SQL Database ServiceObjective

Maybe you need to get information about your SQL database edition and ServiceObjective

To get the information, you can use the following T-SQL script in SSMS

SELECT DATABASEPROPERTYEX('Musicology','Edition') AS Edition , DATABASEPROPERTYEX('Musicology','ServiceObjective') AS ServiceObjective

And maybe you need to modify the Service Objective via T-SQL, in this case you have to use SERVICE_OBJECTIVE argument with the T-SQL script below:

ALTER DATABASE [Musicology] MODIFY(EDITION = 'standard', MAXSIZE = 100 MB, SERVICE_OBJECTIVE = 'S0');

(You have to wait few minutes before the changes occur)

If you run again the first T-SQL script, you can see the new edition and Service Objective

Protect your SQL data with always encrypted

From this article, there is a good definition of what always encrypted is:

"Always Encrypted is a feature designed to protect sensitive data, such as credit card numbers or national identification numbers (for example, U.S. social security numbers), stored in Azure SQL Database or SQL Server databases. Always Encrypted allows clients to encrypt sensitive data inside client applications and never reveal the encryption keys to the Database Engine ( SQL Database or SQL Server). As a result, Always Encrypted provides a separation between those who own the data (and can view it) and those who manage the data (but should have no access). By ensuring on-premises database administrators, cloud database operators, or other high-privileged, but unauthorized users, cannot access the encrypted data, Always Encrypted enables customers to confidently store sensitive data outside of their direct control."

During the process to encrypt your data you will have to choose between determinic or randomized encryption.

Deterministic encryption always generates the same encrypted value for any given plain text value. Using deterministic encryption allows point lookups, equality joins, grouping and indexing on encrypted columns. However, it may also allow unauthorized users to guess information about encrypted values by examining patterns in the encrypted column, especially if there is a small set of possible encrypted values, such as True/False, or North/South/East/West region. Deterministic encryption must use a column collation with a binary2 sort order for character columns.
Randomized encryption uses a method that encrypts data in a less predictable manner. Randomized encryption is more secure, but prevents searching, grouping, indexing, and joining on encrypted columns.
Use deterministic encryption for columns that will be used as search or grouping parameters, for example a government ID number.
Use randomized encryption, for data such as confidential investigation comments, which are not grouped with other records and are not used to join tables.

From SSMS, right click the WikipediaLogs table and select "Encrypt Columns..."

The introduction window appears, click on Next

Choose the columns you want to encrypt and the encryption type. Click on the Next button

According your security policies, you can store your key either in Windows certificate store or in Azure Key Vault. Click on the Next button

On the Run Settings window, click on the Next button

On the Summary window, clik on Finish button

If all is ok, you should get a result like the one in the screenshot below

Now, if you query your table, you can see that the data in the Request column are encrypted

SQL Data Sync

In the case you have to synchronize bi-directionally data across multiple SQL sources (on-premise or in the Cloud), it's a good idea to read this article

Azure Data Factory and Powershell

It’s possible to automate pipeline creation with PowerShell. Below the sequence to create an Azure Data Factory pipeline with PowerShell. (order matters)

optional: create a resources group
Data Factory creation: Set-AzDataFactoryV2
Linked Services creation: Set-AzDataFactoryV2LinkedService
Datasets creation: Set-AzDataFactoryV2Dataset
Pipeline creation: Set-AzDataFactoryV2Pipeline
Run the pipeline: Invoke-AzDataFactoryV2Pipeline
Get details about the run: Get-AzDataFactoryV2ActivityRun

More details are availble in this article

Azure Data Factory and Integration Runtime

Azure Data Factory relies on Integration Runtime, a compute infrastructure, to provide several data integration capabilities like:

Data Flow: Execute a Data Flow in managed Azure compute environment.
Data movement: Copy data across data stores in public network and data stores in private network (on-premises or virtual private network). It provides support for built-in connectors, format conversion, column mapping, and performant and scalable data transfer.
Activity dispatch: Dispatch and monitor transformation activities running on a variety of compute services such as Azure Databricks, Azure HDInsight, Azure Machine Learning, Azure SQL Database, SQL Server, and more.
SSIS package execution: Natively execute SQL Server Integration Services (SSIS) packages in a managed Azure compute environment.

There are several Integration Runtime types

IR Type	Public network	Private network
Azure	Data Flow
	Data movement
	Activity dispatch
Self-hosted	Data movement	Data movement
	Activity dispatch	Activity dispatch
Azure-SSIS	SSIS package execution	SSIS package execution

More information can be found in this article

Azure Databricks and Interactive cluster

Previously in this article, we invoked Azure Databrick's notebook via Azure Data Factory. But you can also use the notebook with interactive cluster.

From your Azure Databricks's workspace, on the left, click on "Clusters" then on "Create Cluster"

Fill the creation cluster form

You have to choose for a cluster mode

Standard clusters are the default and can be used with Python, R, Scala, and SQL. Standard clusters are configured to automatically terminate after 120 minutes.
High-concurrency clusters are tuned to provide the efficient resource utilization, isolation, security, and the best performance for sharing by multiple concurrently active users. High concurrency clusters support only SQL, Python, and R languages. High concurrency clusters are configured to not terminate automatically.

Select the worker type and driver type you need

Workers type run the Spark executors and other services required for the proper functioning of the clusters. When you distribute your workload with Spark, all of the distributed processing happens on workers.
Driver type maintains state information of all notebooks attached to the cluster. The driver node is also responsible for maintaining the SparkContext and interpreting all the commands you run from a notebook or a library on the cluster. The driver node also runs the Apache Spark master that coordinates with the Spark executors. The default value of the driver node type is the same as the worker node type. You can choose a larger driver node type with more memory if you are planning to collect() a lot of data from Spark workers and analyze them in the notebook.

By default the driver node uses the same instance type as the worker node

Note: have in mind that Azure Databricks bills you for virtual machines (VMs) provisioned in clusters AND Databricks Units (DBUs) based on the VM instance selected. A DBU is a unit of processing capability per hour. More details in this article

Few minutes after clicking the create cluster, you should get a new cluster running

On the left, click on "Recents" button and select the notebook

You need to attach the notebook to a cluster. Click on "Detached" and select a cluster

You can run the whole notebook or cell by cell.

If you want to run all the notebook, click on "Run All" button

If you want to run the notebook cell by cell, click on a cell and hit the "Shift" + "Enter" key or click on the arrow on the top left pf the cell

Before running the notebook, you have to enter parameters. (If widgets are not available, run cmd4 by hiting "Shift" + "Enter")

WARNING: in our case, to test the notebook with interactive cluster, be sure you have files in the demo_datasets folder in your data lake, set the parameters according the files in the folder, do it cell by cell and start from the cell cmd4

Conclusion

I know that is a very long article, but I think it helps you to build your knowledge on our data platform to maximize your chances with the DP200 exam (and maybe the DP201 ;)). I also recommend having a look on the following topics that I don't cover in this article:

Azure Cosmos DB (have a special read on how create collections and how to deal with partition)
A read on HDInsight (Storm, MapReduce and Spark)
File type (Avro, Parquet, ORC)
Real Time data (Stream Analytics, Event hub, Edge)
SQL Data Warehouse distributed tables

Franck Mercier

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
pictures		pictures
resources		resources
Readme.markdown		Readme.markdown

franmer2/HelpForDP200

Folders and files

Latest commit

History

Repository files navigation

Build an Analytic solution in Azure

Azure Data Factory v2

Azure Databricks

Few words about the data lake

Demonstration architecture

Prerequisites

Creating the solution

Creating the resource group

Creation of the data lake

Creating the SQL Database Database

Azure Data Factory Service Creation

Creation of the Azure Databricks service

Creating an Azure Key Vault

Setting the Azure Key Vault

Configuring Azure Databricks with Azure Key Vault

Integration with Azure Key Vault

Import Databricks's notebook

Creating access tokens

Landscaping of our data lake

Creating the Azure Data Factory pipeline

Creating linked services and datasets

Creating linked services

Azure Key Vault Linked Service

Azure Databricks Linked Service

Creating datasets

Creating the dataset for the Wikipedia logs site

Beginning of the creation of the Wikipedia dataset

Creating the Wikipedia Linked Service

Continuig Wikipedia dataset creation

Creating a Data Lake dataset

Dataset creation for the data lake

Creating a Data Lake linked service

Continuing Data Lake dataset creation

Creating a parametrized Data Lake dataset

Pipeline creation

Pipeline settings

Creation of the "Delete" activity

"Get Metadata" activity

"If Condition" activity

"ForEach" activity

"Notebook" activity

Power BI

Schedule daily Azure Data Factory pipeline execution

Tips and Tricks for the exam

Secure your storage layer

Data Lake Gen2

Secure SQL Database

Dynamic Data Masking

Row Level Security

Advanced Data Security

Auditing

Polybase

Create Azure SQL Data Warehouse

Configure Polybase

Create a master key

Create a database scope credential

Create the external data source

Configure data format

Create the external table

Azure Data Warehouse Monitoring

Backup SQL and move it to another destination

SQL Database ServiceObjective

Protect your SQL data with always encrypted

SQL Data Sync

Azure Data Factory and Powershell

Azure Data Factory and Integration Runtime

Azure Databricks and Interactive cluster

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages