Dez 17 2017
When writing about satellite loading, we need also to talk about hub and link loading. They happen at the same time.
Before Data Vault 2.0 a hub or link surrogate key was an integer. Nowadays modern data vaulters would use hash keys instead to allow parallel loading. Others are going back to the roots and are using natural business keys. I dropped that idea because there was a reason to define something else instead. In our case it is about the handling of strings. If I want to be sure that something is unique, I would have to specify a binary collation for the business key column. Too much work and I wanted to have a properly working data warehouse first before going into experimental work with all the obstacles involved.
The following pictures shows the data flow for all data vault objects:
From the stage table, we load data to satellite, hubs and links. As you can imagine, there might be multiple satellites for 1 hub. Unfortunately, WhereScape doesn’t allow multi-to-one mapping. What exists is a kind of merge stage table.
This leaves us with the following data pipeline:
From the stage table (which loads also data to the satellite) we create a stage view with columns like they consist in the hub. Then we would add a stage merge to combine multiple views into one. The hub WOULD then load data from the stage merge (but doesn’t).
This pipeline is for lineage only. We can track columns and tables forth and back without breaking the lineage.
What we do is to “recycle” the stored procedure which loads data from the merge stage table to the hub/link and replace the source table with the stage view.
In that way, we can load anything in parallel.
What are the requirements:
For the hub or link we need to create a hash key. The hash key for a hub can be calculated from multiple business keys. And the link hash key from multiple hub business keys. I add also business keys to the link which have no hubs. A peg-legged hub.
That’s it what we need from the stage table.
The primary key of the hub is the hash key. The business keys form a unique constrain.
The primary key of the link is the hash key. The hub hash keys and added peg-legged hub columns should form a unique constrain. Unfortunately, WhereScape doesn’t support that yet. This is caused by the key type. Usually the key type ‘A’ is the business key. Hub columns have the key type ‘h’, so we can’t assign ‘A’ to it. Maybe they come up with a solution some when.
Before going into the detail, I want to stress one major difference to other templates I have seen. Usually they insert new records where the business keys doesn’t exist. I changed it to insert new records where the hash key don’t exist. Why so? First, the hash key is the primary key and has usually an index on it. Much faster to search in. If I would scan for the business keys they can consist of multiple columns which is slower again.
The main objective against using business keys is, that SQL Server might not find a difference in the value due to the collation used. You would have to set the business key to a binary collation to capture every change. Without a binary collation, it would lead to satellite records not having a hub record and therefore a broken foreign-key relationship. Happened to us.
What about hash key collisions? A lot is talked about. To get these collisions, I check for the existence of both, hash key and business keys. The unique constrain on the hub or link (when it gets supported) will give us a crash, if I get duplicate records.
What are the steps:
Insert new records
INSERT INTO Hub
SUB-SELECT Row Condensing
SUB-SELECT Where Not Exists
SUB-SELECT Extract from Stage Table
Extract from Stage Table
Get the data from the stage table.
Where Not Exists
We want to load only new records.
SELECT * FROM (sub_query) AS hub WHERE NOT EXISTS ( SELECT 1 FROM hub AS hub__not_exist WHERE <hub/link_key> = <hub/link_key> AND <business key columns> = <business key columns>
If you want to get faster performance and remove collisions, delete the check for
<business key columns>.
Very easy to make sure that it crashes when you hit one. Tell me about it, when you have one (really in your work, not by googling for a collision). Maybe I’m dead until this happens.
It is possible, that we get multiple rows for a hub, but having many different
dss_load_datetime values. We want to get the first used.
We must make also sure that we don’t suppress hash collisions. Meaning we could have same hash key for different business keys.
ROW_NUMBER() OVER (PARTITION BY <hub/link_hash_key>, <business key columns> ORDER BY dss_load_datetime ASC AS dss_row_no
WHERE dss_row_no = 1
As we load data from stage views instead of the merge stage, we need to replace the merge stage table name with the stage view name.
The result looks like this:
SET @v_source_table = CASE WHEN @p_task_name = 'hub_demo' THEN 'hub_demo' ELSE @p_task_name END SET @v_sql = N' INSERT INTO [TABLEOWNER].[hub_demo] WITH ( TABLOCKX ) ( hub_product_key ,shop_code ,sku ,dss_load_datetime ,dss_record_source ,dss_create_datetime ) SELECT hub_demo.hub_product_key AS hub_product_key ,hub_demo.shop_code AS shop_code ,hub_demo.sku AS sku ,hub_demo.dss_load_datetime AS dss_load_datetime ,hub_demo.dss_record_source AS dss_record_source ,@v_current_datetime AS dss_create_datetime FROM ( SELECT * ,ROW_NUMBER() OVER (PARTITION BY hub_product_key , shop_code , sku ORDER BY dss_load_datetime ASC AS dss_row_no FROM ( SELECT * FROM ( SELECT stage_hub_demo.hub_product_key AS hub_product_key ,stage_hub_demo.shop_code AS shop_code ,stage_hub_demo.sku AS sku ,stage_hub_demo.dss_load_datetime AS dss_load_datetime ,stage_hub_demo.dss_record_source AS dss_record_source FROM [TABLEOWNER].[stage_hub_demo] stage_hub_demo ) AS hub_demo WHERE NOT EXISTS ( SELECT 1 FROM [TABLEOWNER].[hub_demo] hub_demo__not_exist WHERE hub_demo__not_exist.hub_product_key = hub_demo.hub_product_key ) ) AS hub_demo ) AS hub_demo WHERE dss_row_no = 1 ' SET @v_sql = REPLACE(@v_sql,'@v_current_datetime','CAST(''' + CAST(@v_current_datetime AS NVARCHAR) + ''' AS DATETIME2(7))') SET @v_sql = REPLACE(@v_sql,'stage_hub_demo',@v_source_table)
Download from the template section the following templates for WhereScape written in Pebble:
- nu3_sqlserver_ddl_view_stage (if you want to configure stage objects as stage views)
For the hub just select the template. This will generate the stored procedure to load data from the merge stage table.
Now comes the tricky (or dirty) part. Open the properties of the stage view and select this generated stored procedure in the “Custom Procedure” option field. With the replace function in place, we can call now on the stage view “Execute Custom Procedure” and it will load from the view instead of the default merge stage object.