Loading data to a dimension is straight forward. Prepare the data in a stage table and load it to dimension. The interesting stuff is how to define and load the dimension.
What are the requirements:
Stage
Nothing special. Just make sure that the business key is unique.
Dimensions
This has taken me some time to figure out all rules how to build a dimension fast and properly.
The primary key of a dimension is the surrogate key. It is usually an integer identity column start counting from 1 upwards.
But when working with it, I found a lot of exceptions. Especially around zero keys. What is the zero key? It is the record used when no value is defined (aka NULL) in the fact table.
I got confused because the feature is called zero key. It implies that this record is always of the value 0. But in Data Vault, we define missing values as -1. Should I use here also -1 as unknown value? I decided to go with 0 because maybe I need to revert back to WhereScape standard functionality with SCD stuff and so on and WhereScape defines it as 0.
But then I hit dimensions where 0 couldn’t be the zero key. Then the argument came that we shouldn’t give meaning to the surrogate key. But people do. E.g. the key in the date dimension is usually in the form of yyyymmdd. And then I read that the zero key of the date dimension should be something in the future because otherwise the data would be in the same partition as old data.
This leads to the following requirements:
- I want to control the id generated. I need to map a column from stage to the dimension table
- I want to be able to set the zero key to a user-defined value
- I want to suppress the zero key generated
- I want to suppress that missing business keys are getting inserted by the fact table
For that to work I used the recently introduced feature of extended properties. I have the following extended properties defined on the dimension:
zero_key_id | Optional: Define another zero key value |
insert_zero_key | Optional: Set it to false if no zero key should be added. |
insert_missing_bk | Optional: Set it to false if no missing business keys from a fact table should be added |
Stored Procedure
The following steps are defined:
Insert Zero Key
SELECT Result
SUB-SELECT Where Not Exists
SUB-SELECT Zero Key Record
Update existing records
Insert new records
INSERT INTO Dimension
SELECT Result
SUB-SELECT Where Not Exists
SUB-SELECT Extract from Stage Table
Insert Zero Key
As already mentioned, I had some difficulties to understand, how to set up the zero key properly.
By default, WhereScape creates the artificial key, business key and dss_create_datetime
and dss_update_datetime
. Any other column for which you can define zero key values, are not considered. They are just NULL
.
But I wanted to add dss_load_datetime
as a system column. This is not supported.
For our cube, I don’t want to have NULL
values. I came up with a view template, which replaces any NULL
values from a dimension column based on the zero key value specified. For normal SQL queries on the table, NULL
will still be shown.
Update existing records
The existing solution was just too slow. Comparing all columns for a change takes ages if the dimension table goes into the millions and having tons of columns.
We learnt already when loading data to a history table, that comparing only one column is much faster. The option here is to define a dss_change_hash
. This will speed up the update process compared to default WhereScape behavior.
But there is something faster. Calculating dss_change_hash
takes up also some processing time. With what can we replace it?
Until now we have taken a lot of energy to pump dss_load_datetime
throughout the system. The idea is now, to just update columns if there is a change in dss_load_datetime
. If there is a change let’s just update it. No need to compare if there is REALLY a change. This would kill the performance again.
How can we make sure, that there is always the most recent dss_load_datetime
? Especially if we are joining data from dozens of satellites? I have modified the stage template slightly to calculate the most recent datetime by adding a transformation on dss_load_datetime
.
(
SELECT MAX(v) FROM (VALUES
(first.dss_load_datetime)
,(second.dss_load_datetime)
,(....dss_load_datetime)
) AS VALUE(v)
)
Is that not brilliant? In that way, we get always the most recent data. If somebody updates data on an existing record it will fail of course. But then he doesn’t do Data Vault.
Here the update query:
UPDATE [TABLEOWNER].[dim_demo] WITH ( TABLOCKX )
SET column1 = stage_dim_demo.column1
,column... = stage_dim_demo.column...
,dss_load_datetime = stage_dim_demo.dss_load_datetime
,dss_update_datetime = @v_current_datetime
FROM [TABLEOWNER].[stage_dim_demo] stage_dim_demo
WHERE dim_demo.shop_code = stage_dim_demo.shop_code
AND dim_demo.sku = stage_dim_demo.sku
AND dim_demo.dss_load_datetime <> stage_dim_demo.dss_load_datetime
Select Extract from Stage
Load the data from the stage table.
Where Not Exists
Load Data only if the dimension doesn’t exist. Records that does exist have already been updated.
WHERE NOT EXISTS (
SELECT 1
FROM [TABLEOWNER].[dim_demo] dim_demo__not_exist
WHERE dim_demo__not_exist.shop_code = dim_demo.shop_code
)
Result
The resulting query looks like this:
INSERT INTO [TABLEOWNER].[dim_demo] WITH ( TABLOCKX )
(
sku
,product_group_l1
,product_group_l2
,product_family
,dss_load_datetime
,dss_create_datetime
,dss_update_datetime
)
SELECT
dim_demo.sku AS sku
,dim_demo.product_group_l1 AS product_group_l1
,dim_demo.product_group_l2 AS product_group_l2
,dim_demo.product_family AS product_family
,dim_demo.dss_load_datetime AS dss_load_datetime
,@v_current_datetime AS dss_create_datetime
,@v_current_datetime AS dss_update_datetime
FROM (
SELECT dim_demo.*
FROM (
SELECT
stage_dim_demo.sku AS sku
,stage_dim_demo.product_group_l1 AS product_group_l1
,stage_dim_demo.product_group_l2 AS product_group_l2
,stage_dim_demo.product_family AS product_family
,stage_dim_demo.dss_load_datetime AS dss_load_datetime
FROM [TABLEOWNER].[stage_dim_demo] stage_dim_demo
) AS dim_demo
WHERE NOT EXISTS (
SELECT 1
FROM [TABLEOWNER].[dim_demo] dim_demo__not_exist
WHERE dim_demo__not_exist.shop_code = dim_demo.shop_code
AND dim_demo__not_exist.sku = dim_demo.sku
)
) AS dim_demo
Implementation
Just download from the template section the following templates for WhereScape written in Pebble:
- nu3_sqlserver_procedure_dim
- nu3_sqlserver_block_dim
Select it from the properties. That’s it.
Jan 6 2018
Template: Dimension
Loading data to a dimension is straight forward. Prepare the data in a stage table and load it to dimension. The interesting stuff is how to define and load the dimension.
What are the requirements:
Stage
Nothing special. Just make sure that the business key is unique.
Dimensions
This has taken me some time to figure out all rules how to build a dimension fast and properly.
The primary key of a dimension is the surrogate key. It is usually an integer identity column start counting from 1 upwards.
But when working with it, I found a lot of exceptions. Especially around zero keys. What is the zero key? It is the record used when no value is defined (aka NULL) in the fact table.
I got confused because the feature is called zero key. It implies that this record is always of the value 0. But in Data Vault, we define missing values as -1. Should I use here also -1 as unknown value? I decided to go with 0 because maybe I need to revert back to WhereScape standard functionality with SCD stuff and so on and WhereScape defines it as 0.
But then I hit dimensions where 0 couldn’t be the zero key. Then the argument came that we shouldn’t give meaning to the surrogate key. But people do. E.g. the key in the date dimension is usually in the form of yyyymmdd. And then I read that the zero key of the date dimension should be something in the future because otherwise the data would be in the same partition as old data.
This leads to the following requirements:
For that to work I used the recently introduced feature of extended properties. I have the following extended properties defined on the dimension:
Stored Procedure
The following steps are defined:
Insert Zero Key
SELECT Result
SUB-SELECT Where Not Exists
SUB-SELECT Zero Key Record
Update existing records
Insert new records
INSERT INTO Dimension
SELECT Result
SUB-SELECT Where Not Exists
SUB-SELECT Extract from Stage Table
Insert Zero Key
As already mentioned, I had some difficulties to understand, how to set up the zero key properly.
By default, WhereScape creates the artificial key, business key and
dss_create_datetime
anddss_update_datetime
. Any other column for which you can define zero key values, are not considered. They are justNULL
.But I wanted to add
dss_load_datetime
as a system column. This is not supported.For our cube, I don’t want to have
NULL
values. I came up with a view template, which replaces anyNULL
values from a dimension column based on the zero key value specified. For normal SQL queries on the table,NULL
will still be shown.Update existing records
The existing solution was just too slow. Comparing all columns for a change takes ages if the dimension table goes into the millions and having tons of columns.
We learnt already when loading data to a history table, that comparing only one column is much faster. The option here is to define a
dss_change_hash
. This will speed up the update process compared to default WhereScape behavior.But there is something faster. Calculating
dss_change_hash
takes up also some processing time. With what can we replace it?Until now we have taken a lot of energy to pump
dss_load_datetime
throughout the system. The idea is now, to just update columns if there is a change indss_load_datetime
. If there is a change let’s just update it. No need to compare if there is REALLY a change. This would kill the performance again.How can we make sure, that there is always the most recent
dss_load_datetime
? Especially if we are joining data from dozens of satellites? I have modified the stage template slightly to calculate the most recent datetime by adding a transformation ondss_load_datetime
.Is that not brilliant? In that way, we get always the most recent data. If somebody updates data on an existing record it will fail of course. But then he doesn’t do Data Vault.
Here the update query:
Select Extract from Stage
Load the data from the stage table.
Where Not Exists
Load Data only if the dimension doesn’t exist. Records that does exist have already been updated.
Result
The resulting query looks like this:
Implementation
Just download from the template section the following templates for WhereScape written in Pebble:
Select it from the properties. That’s it.
By ecki • Business Intelligence 0