Introduction to Azure SQL Data Warehouse

Azure SQL Data Warehouse

Single Database
Usually SQL data warehouse contains more than one database to distinguish each layer such as staging, ods, audit/error and data mart. But in ADW it is a single database which support all layers separated by schema instead of DB. To get maximum performance gain, whole data warehouse objects should contains with in a single database even though it supports to create multiple data warehouse databases in the same instance.

Cost Effective
ADW instance can be scale up and down on demand. This will help user to manage the cost because costing will be calculated based on DWU and service running time (rate per hour). When data manipulation is started user can scale up the service to upper tier which has higher computation power and maximum parallelism to complete the ELT/ETL process as mush as faster. Then scale down to usual tier with lower compute power or pause it if it is not required to be up and running once analytics cube  is processed. All these operation can be automated with in the Azure.

Massively Parallel Processing
Azure SQL Data Warehouse (so called ADW) is Azure PaaS service which is specifically designed for data warehousing solutions. This is not just a SQL database as Azure SQL PaaS. It supports MPP stands for Massively Parallel Processing. This is like a Apache hadoop file system. Having a control node and multiple compute nodes to process user query in parallel with distributed data  in 60 distributions per database. The query processing and concurrency is handled by resource classes and concurrency slots in each resources classes. The number of concurrent queries and concurrency slots are vary by the Data Warehouse Unit selected in the service.

Distribution
How data of a table need to be distributed in the data warehouse is defined when creating the table. Any way there are 60 parallel distributions as explained above. Data can be distributed among those 60 based on defined distribution method such as Hash, Round Robin and Replicate.
  • Hash Distribution
Distribute table data based on given hash key which is a column of the table. This columns can be any but to find optimal hash key for hash distribution, that column should have at least 60 distinct values, should not be the key column of the table, should use in many joins and relationships. Probably fact tables can be created as Hash distribution. There is a recommendation to distribute table as hash if it contains more than 2GB data in the table.
  • Round Robin
This is evenly data distributing for a table without considering a distribution key. This recommend for staging tables as this perform data load in to data warehouse faster than other methods. But there are some slowness with data selects.
  • Replicate
Replicate is recommended for small tables such as lookup, configurations and small dimension data tables which are heavily used in queries. Also this method distribute the entire row set in each distribution. Simply it is a copy of the table in each distribution. Then there is no data movements when query is executing as all the data available in the all the distributions.

Data Warehouse Unit
Unit of measurements to measure ADW computation/processing power or also known as ADW performance capacity. This defines the memory availability, size of database, number of compute nodes, number of distributions per compute nodes, number of concurrent queries, number of concurrences for each resource classes. This can be adjusted in the portal and it will take few minutes to get reflected the changes into ADW instance. You can read more about DWU and relevant capacity in below Microsoft docs.
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/memory-and-concurrency-limits

Workload Management
In ADW, some query like data transformation needs more processing power where as some query needs more parallelism like data loads into ADW. In order to do this in a optimal way you need to know about ADW workload management. This is not a simple task as this need more hands on and optimize based on the experience and following best practices to get maximum query performance from ADW. You can read more on this in below article.
https://docs.microsoft.com/en-us/azure/sql-data-warehouse/resource-classes-for-workload-management

Concurrency
Concurrency is nothing but the number of parallel query execution for resource class based on selected DWU. If the prioritized query is running on ADW with more concurrency slots, the rest of low prioritized queries will wait in the queue to be executed. Concurrency slot is the allocated memory slot use to execute query based on logged in user assigned resource class.

Resource Class
Resource Class (RC) is the pre-defined cloud resource allocation for query execution on ADW. This is like a ADW role in terms of resources such as I/O, memory and processor. Resource class defines the number of concurrency slot to be allocated to query execution. There are two types of RC as dynamic and static.

In dynamic RC there are four classes as small, medium, large and extra large. By it name itself giving the definition for the RC. Dynamic RC change the allocated resources when ADW scale in out. But Static does not change it unless few exceptions in staticrc 50,60,70,80 RCs. When you create a user login and db user, you have to assign a resource class either dynamic or static. By default it is dynamic smallrc. When you login by the user and execute some query on ADW, that query will utilize ADW resources that allocated to user assigned RC.

Read Concurrency Maximums get broader idea of RC resource allocation.


Comments