Data Warehouse Hands On

This blog will discuss about the essentials of Azure Data Warehouse database queries. First of all you have to create a ADW database in Azure which I already posted in Creating Azure Data Warehouse post. Then you can connect to the ADW database using the user credentials given when you create the DB. But you will end up with connectivity issue as given below.

Change Firewall settings
Connecting to ADW instance using SSMS

Create a user - first you have to create a login and then create a user for that login as given below

--Select master db to create a login

CREATE LOGIN <yourusername> WITH PASSWORD = 'Password';

--Select ADW db to create the user for the login

CREATE USER <yourusername> FOR LOGIN <yourusername>;

Change user RC - once you have user created you can change resource class that user assigned and the db role as well. Database role is something for authorization but RC is something classified how much resource such as memory, compute power is assigned for the user.

--To assign RC
EXEC sp_addrolemember 'xlargerc', '<yourusername>';

--To assign db role
EXEC sp_addrolemember 'db_owner', '<yourusername>';

--Check user
SELECT *
FROM sys.database_principals
WHERE name LIKE '%yourusername%'
--AND type_desc = 'db_owner';

Create table and insert sample data into the table

CREATE TABLE dbo.MyTable_8000264

(

Id INT IDENTITY (0,1)

,FirstName VARCHAR(100)

,LastName VARCHAR(100)

,DOB DATE

)

--Try below insert query

INSERT INTO dbo.MyTable_8000264

VALUES ('Jhon','Smith','1990-10-19')
,('Kane','Loyd','1991-02-14')
,('Julia','Robertson','1989-04-22')
,('Ane','Jakob','1995-03-04')
,('Ane','Peterson','1993-05-08')

you will see that you cannot records like that but, try below

--Try below insert query
INSERT INTO dbo.MyTable_8000264
VALUES ('Jhon','Smith','1990-10-19')
GO
INSERT INTO dbo.MyTable_8000264
VALUES ('Kane','Loyd','1991-02-14')
GO
INSERT INTO dbo.MyTable_8000264
VALUES ('Julia','Robertson','1989-04-22')
GO
INSERT INTO dbo.MyTable_8000264
VALUES ('Ane','Jakob','1995-03-04')
GO
INSERT INTO dbo.MyTable_8000264
VALUES ('Ane','Peterson','1993-05-08')
GO

Table distributes into 60 distributions - in ADW each table is distributed in to 60 distributions and data is spread among these. This is like hadoop file system. You can see the data distribution using below query.

-- Find data skew for a distributed table

DBCC PDW_SHOWSPACEUSED('dbo.MyTable_8000264');

Create table with different distributions and insert into distributed table

Round robin – this is the default if you don't mention it specifically. Data will be distributed randomly with in 60 distributions.

CREATE TABLE dbo.MyTable_8000264_rr

(

Id INT IDENTITY (0,1)

,FirstName VARCHAR(100)

,LastName VARCHAR(100)

,DOB DATE

)

--OR automatically it creates round robin as below

CREATE TABLE [dbo].[MyTable_8000264_rr]

(

[Id] [int] IDENTITY(0,1) NOT NULL,

[FirstName] [varchar](100) NULL,

[LastName] [varchar](100) NULL,

[DOB] [date] NULL

)

WITH

(

DISTRIBUTION = ROUND_ROBIN,

CLUSTERED COLUMNSTORE INDEX

)

-- Find data skew for a distributed table

DBCC PDW_SHOWSPACEUSED('dbo.MyTable_8000264_rr');

Hash - data will be distributed based on hash key given when creating the table. usually to select the hash key, the distinct data count should be more than 60. otherwise data will be distributed un-evenly.

CREATE TABLE dbo.MyTable_8000264_h

(

Id INT IDENTITY (0,1)

,FirstName VARCHAR(100)

,LastName VARCHAR(100)

,DOB DATE

)

WITH

(

DISTRIBUTION = HASH(FirstName),

HEAP

)

-- Find data skew for a distributed table

DBCC PDW_SHOWSPACEUSED('dbo.MyTable_8000264_h');

Replicate - this will duplicate the data in each distribution. This option is best for master data tables and small scale tables.

CREATE TABLE dbo.MyTable_8000264_r

(

Id INT IDENTITY (0,1)

,FirstName VARCHAR(100)

,LastName VARCHAR(100)

,DOB DATE

)

WITH

(

DISTRIBUTION = REPLICATE,

HEAP

)

-- Find data skew for a distributed table

DBCC PDW_SHOWSPACEUSED('dbo.MyTable_8000264_r');

But we cannot see that duplicate in the data using above. I also don't know why it is so.

Identity insert

SET IDENTITY_INSERT dbo.MyTable_8000264 ON
GO
INSERT INTO dbo.MyTable_8000264 (Id,FirstName,LastName,DOB)
VALUES (123,'Jhon','Smith','1990-10-19')
GO
SELECT * FROM dbo.MyTable_8000264

-- Do the same insert with set identity insert off for a hash distributed table
SET IDENTITY_INSERT dbo.MyTable_8000264 OFF
GO
INSERT INTO dbo.MyTable_8000264 (FirstName,LastName,DOB)
VALUES ('Jhon','Smith','1990-10-19')
GO
SELECT * FROM dbo.MyTable_8000264

NOTE : You will see the identity is not continues even though it is incremental. That is because of the identity is based on the distribution.

Concurrency and slot usage - this query will give you the statistics of the query execution with concurrency, and slots used by the execution and so on.

SELECT

SUM(CASE WHEN r.[status] = 'Running' THEN 1 ELSE 0 END) [running_queries],

SUM(CASE WHEN r.[status] = 'Running' THEN rw.concurrency_slots_used ELSE 0 END) [running_queries_slots],

SUM(CASE WHEN r.[status] = 'Suspended' THEN 1 ELSE 0 END) [queued_queries],

SUM(CASE WHEN rw.[state] = 'Queued' THEN rw.concurrency_slots_used ELSE 0 END) [queued_queries_slots]

FROM

sys.dm_pdw_exec_requests r

JOIN sys.dm_pdw_resource_waits rw ON rw.request_id = r.request_id

WHERE

( (r.[status] = 'Running' AND r.resource_class IS NOT NULL ) OR r.[status] ='Suspended' )

AND rw.[type] = 'UserConcurrencyResourceType';

SELECT *

FROM sys.dm_pdw_exec_requests AS r

WHERE STATUS != 'Completed'

AND STATUS != 'Failed'

AND STATUS != 'Cancelled';

ADW scaling - the following query can be used to scale ADW in and out with given DWU at query level which can be done in portal itself also.

ALTER DATABASE analyticsbidwdev MODIFY (SERVICE_OBJECTIVE = 'DW200c');

WHILE

(

SELECT TOP 1 state_desc

FROM sys.dm_operation_status

WHERE

1=1

AND resource_type_desc = 'Database'

AND major_resource_id = 'analyticsbidwdev'

AND operation = 'ALTER DATABASE'

ORDER BY

start_time DESC

) = 'IN_PROGRESS'

BEGIN

RAISERROR('Scale operation in progress',0,0) WITH NOWAIT;

WAITFOR DELAY '00:00:05';

END

PRINT 'Complete';

#Table scope - this will support you to see the difference of the # table scope between normal SQL db and ADW. Do the following query execution in both SQL db and ADW db.

CREATE PROC dbo.Proc_1

BEGIN

SELECT 123 AS Id

INTO #TEST

END

CREATE PROC dbo.Proc_2

BEGIN

SELECT 456 AS Id

INTO #TEST

END

CREATE PROC dbo.Proc_Master

BEGIN

EXEC dbo.Proc_1

EXEC dbo.Proc_2

END

--Run main proc

EXEC dbo.Proc_Master

--Drop procs

DROP PROCEDURE dbo.Proc_1

DROP PROCEDURE dbo.Proc_2

DROP PROCEDURE dbo.Proc_Master

Usually the # table is scoped in to the session in both SQL db and ADW db. But the scope of inner stored procedure is bit different. In SQL db, if you execute SP with in another SP, two SPs will execute in two different sessions. But in ADW, second SP will execute in the same session of the first SP. That's why when you have the same # table in second SP gets an error saying # table already exists.

CTAS Table - means Create Table As Select which will create a table using the output of the select statement and given index and distribution option.

CREATE TABLE dbo.MyTable_8000264_CTAS

WITH

(

CLUSTERED COLUMNSTORE INDEX,

DISTRIBUTION = ROUND_ROBIN

)

AS SELECT *

FROM dbo.MyTable_8000264;

That is about the querying ADW db for now and have more practice and find the optimal way to write queries in ADW with resource class and concurrency.

Azure for DWBI

Search This Blog

Data Warehouse Hands On

Comments

Post a Comment