Loading data to a fact table is quite a difficult beast within WhereScape without templates.
First, do I set business keys or not? If I set business keys the whole loading procedure is based on a cursor which updates or inserts 1 record at a time. Not good if I’m going to load 100 mio. of records. This will take ages.
Nowadays more and more people like clustered columnstore indexes. Cursors don’t work with them.
Ok, no business keys. Then I’m able to tick the „Set Based Insert“ option. But with incremental loading I would need to delete or update updated records. So, I tick the „Include Delete Before Insert“. Now I’m able to select a column which matches stage and fact table. But only 1 column! This creates the following WHERE clause:
WHERE fact.column1 IN ( SELECT DISTINCT stage_fact.column1 FROM [TABLEOWNER].[stage_fact] stage_fact )
What about if I have multiple columns as a business key? By the way, we have already decided in the previous step that we don’t have business keys?!?
This was not satisfactory to work with.
What are the requirements:
Stage
In the stage table, we prepare the data set to be loaded. If we have dimensions with surrogate keys, we want to add a dim_key
column to the stage table. This will look up the business key in the dimension table and returns the surrogate key. For business keys which don’t exist we want to add the unknown key.
Maybe we have late arriving dimensions or we are loading dimensions and facts in parallel. In that case, we don’t want to have the unknown key, but adding missing business keys to the dimension table. But for some dimensions I don’t want to get missing keys created (e.g. date dimension or dimensions with a fixed set of values).
Fact
Replace or insert data as fast as possible. Either use the dss_load_datetime
or the dss_change_hash
for delta detection.
Stored Procedure
Stage
There is one thing I want to point out from the standard stage template. How can we add missing business keys to a dimension?
WhereScape by default has another stored procedure to look up dimension keys and update it in the stage table. Again, with a cursor. 1-by-1. Unfortunately, we can’t create templates for that stored procedure.
Some when I figured out, that I could add another section in the stage template by inserting missing business keys into the dimensions. A really brilliant and fast solution.
I just INSERT
a DISTINCT
list of business keys for a dimension followed by an UPDATE
statement. Both are set based. No cursor. Incredibly fast!
INSERT INTO [TABLEOWNER].[dim_product] WITH ( TABLOCKX )
(
shop_code
,sku
,dss_load_datetime
,dss_create_datetime
,dss_update_datetime
)
SELECT DISTINCT
shop_code
,sku
,'0001-01-01'
,@v_current_datetime
,@v_current_datetime
FROM [TABLEOWNER].[stage_demo_fact]
WHERE dim_product_key = 0
UPDATE [TABLEOWNER].[stage_demo_fact] WITH ( TABLOCKX )
SET dim_product_key = dim_product.dim_product_key
FROM [TABLEOWNER].[dim_product]
WHERE stage_demo_fact.dim_product_key = 0
AND dim_product.shop_code = stage_demo_fact.shop_code
AND dim_product.sku = stage_demo_fact.sku
Fact
The following steps are defined:
Delete existing records/Update changed records
Insert new records
INSERT INTO Fact
SELECT Result
SUB-SELECT Where Not Exists
SUB-SELECT Extract from Stage Table
Delete existing records/Update changed records
If you have traditional tables, the template will create an UPDATE
statement. If you are working with Clustered Columnstore Indexed tables then UPDATE
has a really bad performance. Then DELETE
and INSERT
is much faster. With DELETE
it sets only a marker on the record like is_deleted
instead of replacing all the values. Please read the following blog entry for more background information.
Select Extract from Stage
Load the data from the stage table.
Where Not Exists
Load Data only if the fact doesn’t exist. Records that existed have already been updated or deleted.
WHERE NOT EXISTS (
SELECT 1
FROM [TABLEOWNER].[demo_fact] demo_fact__not_exist
WHERE demo_fact__not_exist.sku = demo_fact.sku
AND demo_fact__not_exist.date = demo_fact.date
)
Result
The resulting query looks like this:
INSERT INTO [TABLEOWNER].[demo_fact] WITH ( TABLOCKX )
(
dim_product_key
,dim_date_key
,sku
,date
,purchase_cost
,dss_load_datetime
,dss_create_datetime
,dss_update_datetime
)
SELECT
demo_fact.dim_product_key AS dim_product_key
,demo_fact.dim_date_key AS dim_date_key
,demo_fact.sku AS sku
,demo_fact.date AS date
,demo_fact.purchase_cost AS purchase_cost
,demo_fact.dss_load_datetime AS dss_load_datetime
,@v_current_datetime AS dss_create_datetime
,@v_current_datetime AS dss_update_datetime
FROM (
SELECT demo_fact.*
FROM (
SELECT
stage_demo_fact.dim_product_key AS dim_product_key
,stage_demo_fact.dim_date_key AS dim_date_key
,stage_demo_fact.sku AS sku
,stage_demo_fact.date AS date
,stage_demo_fact.purchase_cost AS purchase_cost
,stage_demo_fact.dss_load_datetime AS dss_load_datetime
FROM [TABLEOWNER].[stage_demo_fact] stage_demo_fact
) AS demo_fact
WHERE NOT EXISTS (
SELECT 1
FROM [TABLEOWNER].[demo_fact] demo_fact__not_exist
WHERE demo_fact__not_exist.sku = demo_fact.sku
AND demo_fact__not_exist.date = demo_fact.date
)
) AS demo_fact
Implementation
Download from the template section the following templates for WhereScape written in Pebble:
- nu3_sqlserver_procedure_fact
- nu3_sqlserver_block_fact
- nu3_sqlserver_procedure_stage (updated)
Select the template in the properties.
For any dimension, for which we don’t want to have missing business keys automatically created, modify the extended property on the dimension insert_missing_bk = false
(see blog post).
That’s it.
Jan 7 2018
Template: Fact
Loading data to a fact table is quite a difficult beast within WhereScape without templates.
First, do I set business keys or not? If I set business keys the whole loading procedure is based on a cursor which updates or inserts 1 record at a time. Not good if I’m going to load 100 mio. of records. This will take ages.
Nowadays more and more people like clustered columnstore indexes. Cursors don’t work with them.
Ok, no business keys. Then I’m able to tick the „Set Based Insert“ option. But with incremental loading I would need to delete or update updated records. So, I tick the „Include Delete Before Insert“. Now I’m able to select a column which matches stage and fact table. But only 1 column! This creates the following WHERE clause:
WHERE fact.column1 IN ( SELECT DISTINCT stage_fact.column1 FROM [TABLEOWNER].[stage_fact] stage_fact )
What about if I have multiple columns as a business key? By the way, we have already decided in the previous step that we don’t have business keys?!?
This was not satisfactory to work with.
What are the requirements:
Stage
In the stage table, we prepare the data set to be loaded. If we have dimensions with surrogate keys, we want to add a
dim_key
column to the stage table. This will look up the business key in the dimension table and returns the surrogate key. For business keys which don’t exist we want to add the unknown key.Maybe we have late arriving dimensions or we are loading dimensions and facts in parallel. In that case, we don’t want to have the unknown key, but adding missing business keys to the dimension table. But for some dimensions I don’t want to get missing keys created (e.g. date dimension or dimensions with a fixed set of values).
Fact
Replace or insert data as fast as possible. Either use the
dss_load_datetime
or thedss_change_hash
for delta detection.Stored Procedure
Stage
There is one thing I want to point out from the standard stage template. How can we add missing business keys to a dimension?
WhereScape by default has another stored procedure to look up dimension keys and update it in the stage table. Again, with a cursor. 1-by-1. Unfortunately, we can’t create templates for that stored procedure.
Some when I figured out, that I could add another section in the stage template by inserting missing business keys into the dimensions. A really brilliant and fast solution.
I just
INSERT
aDISTINCT
list of business keys for a dimension followed by anUPDATE
statement. Both are set based. No cursor. Incredibly fast!Fact
The following steps are defined:
Delete existing records/Update changed records
Insert new records
INSERT INTO Fact
SELECT Result
SUB-SELECT Where Not Exists
SUB-SELECT Extract from Stage Table
Delete existing records/Update changed records
If you have traditional tables, the template will create an
UPDATE
statement. If you are working with Clustered Columnstore Indexed tables thenUPDATE
has a really bad performance. ThenDELETE
andINSERT
is much faster. WithDELETE
it sets only a marker on the record likeis_deleted
instead of replacing all the values. Please read the following blog entry for more background information.Select Extract from Stage
Load the data from the stage table.
Where Not Exists
Load Data only if the fact doesn’t exist. Records that existed have already been updated or deleted.
Result
The resulting query looks like this:
Implementation
Download from the template section the following templates for WhereScape written in Pebble:
Select the template in the properties.
For any dimension, for which we don’t want to have missing business keys automatically created, modify the extended property on the dimension
insert_missing_bk = false
(see blog post).That’s it.
By ecki • Business Intelligence 0