Jan 7 2018
Loading data to a fact table is quite a difficult beast within WhereScape without templates.
First, do I set business keys or not? If I set business keys the whole loading procedure is based on a cursor which updates or inserts 1 record at a time. Not good if I’m going to load 100 mio. of records. This will take ages.
Nowadays more and more people like clustered columnstore indexes. Cursors don’t work with them.
Ok, no business keys. Then I’m able to tick the „Set Based Insert“ option. But with incremental loading I would need to delete or update updated records. So, I tick the „Include Delete Before Insert“. Now I’m able to select a column which matches stage and fact table. But only 1 column! This creates the following WHERE clause:
WHERE fact.column1 IN ( SELECT DISTINCT stage_fact.column1 FROM [TABLEOWNER].[stage_fact] stage_fact )
What about if I have multiple columns as a business key? By the way, we have already decided in the previous step that we don’t have business keys?!?
This was not satisfactory to work with.
What are the requirements:
In the stage table, we prepare the data set to be loaded. If we have dimensions with surrogate keys, we want to add a
dim_key column to the stage table. This will look up the business key in the dimension table and returns the surrogate key. For business keys which don’t exist we want to add the unknown key.
Maybe we have late arriving dimensions or we are loading dimensions and facts in parallel. In that case, we don’t want to have the unknown key, but adding missing business keys to the dimension table. But for some dimensions I don’t want to get missing keys created (e.g. date dimension or dimensions with a fixed set of values).
Replace or insert data as fast as possible. Either use the
dss_load_datetime or the
dss_change_hash for delta detection.
There is one thing I want to point out from the standard stage template. How can we add missing business keys to a dimension?
WhereScape by default has another stored procedure to look up dimension keys and update it in the stage table. Again, with a cursor. 1-by-1. Unfortunately, we can’t create templates for that stored procedure.
Some when I figured out, that I could add another section in the stage template by inserting missing business keys into the dimensions. A really brilliant and fast solution.
DISTINCT list of business keys for a dimension followed by an
UPDATE statement. Both are set based. No cursor. Incredibly fast!
INSERT INTO [TABLEOWNER].[dim_product] WITH ( TABLOCKX ) ( shop_code ,sku ,dss_load_datetime ,dss_create_datetime ,dss_update_datetime ) SELECT DISTINCT shop_code ,sku ,'0001-01-01' ,@v_current_datetime ,@v_current_datetime FROM [TABLEOWNER].[stage_demo_fact] WHERE dim_product_key = 0
UPDATE [TABLEOWNER].[stage_demo_fact] WITH ( TABLOCKX ) SET dim_product_key = dim_product.dim_product_key FROM [TABLEOWNER].[dim_product] WHERE stage_demo_fact.dim_product_key = 0 AND dim_product.shop_code = stage_demo_fact.shop_code AND dim_product.sku = stage_demo_fact.sku
The following steps are defined:
Delete existing records/Update changed records
Insert new records
INSERT INTO Fact
SUB-SELECT Where Not Exists
SUB-SELECT Extract from Stage Table
Delete existing records/Update changed records
If you have traditional tables, the template will create an
UPDATE statement. If you are working with Clustered Columnstore Indexed tables then
UPDATE has a really bad performance. Then
INSERT is much faster. With
DELETE it sets only a marker on the record like
is_deleted instead of replacing all the values. Please read the following blog entry for more background information.
Select Extract from Stage
Load the data from the stage table.
Where Not Exists
Load Data only if the fact doesn’t exist. Records that existed have already been updated or deleted.
WHERE NOT EXISTS ( SELECT 1 FROM [TABLEOWNER].[demo_fact] demo_fact__not_exist WHERE demo_fact__not_exist.sku = demo_fact.sku AND demo_fact__not_exist.date = demo_fact.date )
The resulting query looks like this:
INSERT INTO [TABLEOWNER].[demo_fact] WITH ( TABLOCKX ) ( dim_product_key ,dim_date_key ,sku ,date ,purchase_cost ,dss_load_datetime ,dss_create_datetime ,dss_update_datetime ) SELECT demo_fact.dim_product_key AS dim_product_key ,demo_fact.dim_date_key AS dim_date_key ,demo_fact.sku AS sku ,demo_fact.date AS date ,demo_fact.purchase_cost AS purchase_cost ,demo_fact.dss_load_datetime AS dss_load_datetime ,@v_current_datetime AS dss_create_datetime ,@v_current_datetime AS dss_update_datetime FROM ( SELECT demo_fact.* FROM ( SELECT stage_demo_fact.dim_product_key AS dim_product_key ,stage_demo_fact.dim_date_key AS dim_date_key ,stage_demo_fact.sku AS sku ,stage_demo_fact.date AS date ,stage_demo_fact.purchase_cost AS purchase_cost ,stage_demo_fact.dss_load_datetime AS dss_load_datetime FROM [TABLEOWNER].[stage_demo_fact] stage_demo_fact ) AS demo_fact WHERE NOT EXISTS ( SELECT 1 FROM [TABLEOWNER].[demo_fact] demo_fact__not_exist WHERE demo_fact__not_exist.sku = demo_fact.sku AND demo_fact__not_exist.date = demo_fact.date ) ) AS demo_fact
Download from the template section the following templates for WhereScape written in Pebble:
- nu3_sqlserver_procedure_stage (updated)
Select the template in the properties.
For any dimension, for which we don’t want to have missing business keys automatically created, modify the extended property on the dimension
insert_missing_bk = false (see blog post).