Template: Dimension

Loading data to a dimension is straight forward. Prepare the data in a stage table and load it to dimension. The interesting stuff is how to define and load the dimension.

What are the requirements:

Stage

Nothing special. Just make sure that the business key is unique.

Dimensions

This has taken me some time to figure out all rules how to build a dimension fast and properly.

The primary key of a dimension is the surrogate key. It is usually an integer identity column start counting from 1 upwards.

But when working with it, I found a lot of exceptions. Especially around zero keys. What is the zero key? It is the record used when no value is defined (aka NULL) in the fact table.

I got confused because the feature is called zero key. It implies that this record is always of the value 0. But in Data Vault, we define missing values as -1. Should I use here also -1 as unknown value? I decided to go with 0 because maybe I need to revert back to WhereScape standard functionality with SCD stuff and so on and WhereScape defines it as 0.

But then I hit dimensions where 0 couldn’t be the zero key. Then the argument came that we shouldn’t give meaning to the surrogate key. But people do. E.g. the key in the date dimension is usually in the form of yyyymmdd. And then I read that the zero key of the date dimension should be something in the future because otherwise the data would be in the same partition as old data.

This leads to the following requirements:

  • I want to control the id generated. I need to map a column from stage to the dimension table
  • I want to be able to set the zero key to a user-defined value
  • I want to suppress the zero key generated
  • I want to suppress that missing business keys are getting inserted by the fact table

For that to work I used the recently introduced feature of extended properties. I have the following extended properties defined on the dimension:

zero_key_idOptional: Define another zero key value
insert_zero_keyOptional: Set it to false if no zero key should be added.
insert_missing_bkOptional: Set it to false if no missing business keys from a fact table should be added

Stored Procedure

The following steps are defined:

  • Insert Zero Key
    • SELECT Result
    • SUB-SELECT Where Not Exists
    • SUB-SELECT Zero Key Record
  • Update existing records
  • Insert new records
    • INSERT INTO Dimension
    • SELECT Result
    • SUB-SELECT Where Not Exists
    • SUB-SELECT Extract from Stage Table

Insert Zero Key

As already mentioned, I had some difficulties to understand, how to set up the zero key properly.

By default, WhereScape creates the artificial key, business key and dss_create_datetime and dss_update_datetime. Any other column for which you can define zero key values, are not considered. They are just NULL.

But I wanted to add dss_load_datetime as a system column. This is not supported.

For our cube, I don’t want to have NULL values. I came up with a view template, which replaces any NULL values from a dimension column based on the zero key value specified. For normal SQL queries on the table, NULL will still be shown.

Update existing records

The existing solution was just too slow. Comparing all columns for a change takes ages if the dimension table goes into the millions and having tons of columns.

We learnt already when loading data to a history table, that comparing only one column is much faster. The option here is to define a dss_change_hash. This will speed up the update process compared to default WhereScape behavior.

But there is something faster. Calculating dss_change_hash takes up also some processing time. With what can we replace it?

Until now we have taken a lot of energy to pump dss_load_datetime throughout the system. The idea is now, to just update columns if there is a change in dss_load_datetime. If there is a change let’s just update it. No need to compare if there is REALLY a change. This would kill the performance again.

How can we make sure, that there is always the most recent dss_load_datetime? Especially if we are joining data from dozens of satellites? I have modified the stage template slightly to calculate the most recent datetime by adding a transformation on dss_load_datetime.

(
  SELECT MAX(v) FROM (VALUES 
    (first.dss_load_datetime)
   ,(second.dss_load_datetime)
   ,(....dss_load_datetime)
  ) AS VALUE(v)
)

Is that not brilliant? In that way, we get always the most recent data. If somebody updates data on an existing record it will fail of course. But then he doesn’t do Data Vault.

Here the update query:

UPDATE [TABLEOWNER].[dim_demo] WITH ( TABLOCKX )
SET     column1 = stage_dim_demo.column1
       ,column... = stage_dim_demo.column...
       ,dss_load_datetime = stage_dim_demo.dss_load_datetime
       ,dss_update_datetime = @v_current_datetime
FROM [TABLEOWNER].[stage_dim_demo] stage_dim_demo
WHERE  dim_demo.shop_code = stage_dim_demo.shop_code
       AND dim_demo.sku = stage_dim_demo.sku
       AND dim_demo.dss_load_datetime <> stage_dim_demo.dss_load_datetime

Select Extract from Stage

Load the data from the stage table.

Where Not Exists

Load Data only if the dimension doesn’t exist. Records that does exist have already been updated.

WHERE NOT EXISTS (
        SELECT 1
        FROM   [TABLEOWNER].[dim_demo] dim_demo__not_exist
        WHERE  dim_demo__not_exist.shop_code = dim_demo.shop_code
)

Result

The resulting query looks like this:

INSERT INTO [TABLEOWNER].[dim_demo] WITH ( TABLOCKX )
(
        sku
       ,product_group_l1
       ,product_group_l2
       ,product_family
       ,dss_load_datetime
       ,dss_create_datetime
       ,dss_update_datetime
)
SELECT 
       dim_demo.sku AS sku 
      ,dim_demo.product_group_l1 AS product_group_l1 
      ,dim_demo.product_group_l2 AS product_group_l2 
      ,dim_demo.product_family AS product_family 
      ,dim_demo.dss_load_datetime AS dss_load_datetime 
      ,@v_current_datetime AS dss_create_datetime 
      ,@v_current_datetime AS dss_update_datetime 
FROM   (
  SELECT dim_demo.*
  FROM   (
    SELECT 
           stage_dim_demo.sku AS sku 
          ,stage_dim_demo.product_group_l1 AS product_group_l1 
          ,stage_dim_demo.product_group_l2 AS product_group_l2 
          ,stage_dim_demo.product_family AS product_family 
          ,stage_dim_demo.dss_load_datetime AS dss_load_datetime 
    FROM [TABLEOWNER].[stage_dim_demo] stage_dim_demo
  ) AS dim_demo
  WHERE NOT EXISTS (
          SELECT 1
          FROM   [TABLEOWNER].[dim_demo] dim_demo__not_exist
          WHERE  dim_demo__not_exist.shop_code = dim_demo.shop_code
                 AND dim_demo__not_exist.sku = dim_demo.sku
  )
) AS dim_demo

Implementation

Just download from the template section the following templates for WhereScape written in Pebble:

  • nu3_sqlserver_procedure_dim
  • nu3_sqlserver_block_dim

Select it from the properties. That’s it.