Template: Fact

Loading data to a fact table is quite a difficult beast within WhereScape without templates.

First, do I set business keys or not? If I set business keys the whole loading procedure is based on a cursor which updates or inserts 1 record at a time. Not good if I’m going to load 100 mio. of records. This will take ages.

Nowadays more and more people like clustered columnstore indexes. Cursors don’t work with them.

Ok, no business keys. Then I’m able to tick the „Set Based Insert“ option. But with incremental loading I would need to delete or update updated records. So, I tick the „Include Delete Before Insert“. Now I’m able to select a column which matches stage and fact table. But only 1 column! This creates the following WHERE clause:

WHERE fact.column1 IN ( SELECT DISTINCT stage_fact.column1 FROM [TABLEOWNER].[stage_fact] stage_fact )

What about if I have multiple columns as a business key? By the way, we have already decided in the previous step that we don’t have business keys?!?

This was not satisfactory to work with.

What are the requirements:

Stage

In the stage table, we prepare the data set to be loaded. If we have dimensions with surrogate keys, we want to add a dim_key column to the stage table. This will look up the business key in the dimension table and returns the surrogate key. For business keys which don’t exist we want to add the unknown key.

Maybe we have late arriving dimensions or we are loading dimensions and facts in parallel. In that case, we don’t want to have the unknown key, but adding missing business keys to the dimension table. But for some dimensions I don’t want to get missing keys created (e.g. date dimension or dimensions with a fixed set of values).

Fact

Replace or insert data as fast as possible. Either use the dss_load_datetime or the dss_change_hash for delta detection.

Stored Procedure

Stage

There is one thing I want to point out from the standard stage template. How can we add missing business keys to a dimension?

WhereScape by default has another stored procedure to look up dimension keys and update it in the stage table. Again, with a cursor. 1-by-1. Unfortunately, we can’t create templates for that stored procedure.

Some when I figured out, that I could add another section in the stage template by inserting missing business keys into the dimensions. A really brilliant and fast solution.

I just INSERT a DISTINCT list of business keys for a dimension followed by an UPDATE statement. Both are set based. No cursor. Incredibly fast!

INSERT INTO [TABLEOWNER].[dim_product] WITH ( TABLOCKX )
(
       shop_code
      ,sku
      ,dss_load_datetime
      ,dss_create_datetime
      ,dss_update_datetime
)
SELECT DISTINCT
       shop_code
      ,sku
      ,'0001-01-01'
      ,@v_current_datetime
      ,@v_current_datetime
FROM   [TABLEOWNER].[stage_demo_fact]
WHERE  dim_product_key = 0
UPDATE [TABLEOWNER].[stage_demo_fact] WITH ( TABLOCKX )
SET    dim_product_key = dim_product.dim_product_key
FROM   [TABLEOWNER].[dim_product]
WHERE  stage_demo_fact.dim_product_key = 0
       AND dim_product.shop_code = stage_demo_fact.shop_code
       AND dim_product.sku = stage_demo_fact.sku

Fact

The following steps are defined:

  • Delete existing records/Update changed records
  • Insert new records
    • INSERT INTO Fact
    • SELECT Result
    • SUB-SELECT Where Not Exists
    • SUB-SELECT Extract from Stage Table

Delete existing records/Update changed records

If you have traditional tables, the template will create an UPDATE statement. If you are working with Clustered Columnstore Indexed tables then UPDATE has a really bad performance. Then DELETE and INSERT is much faster. With DELETE it sets only a marker on the record like is_deleted instead of replacing all the values. Please read the following blog entry for more background information.

Select Extract from Stage

Load the data from the stage table.

Where Not Exists

Load Data only if the fact doesn’t exist. Records that existed have already been updated or deleted.

WHERE NOT EXISTS (
        SELECT 1
        FROM   [TABLEOWNER].[demo_fact] demo_fact__not_exist
        WHERE  demo_fact__not_exist.sku = demo_fact.sku
               AND demo_fact__not_exist.date = demo_fact.date
)

Result

The resulting query looks like this:

INSERT INTO [TABLEOWNER].[demo_fact] WITH ( TABLOCKX )
(
        dim_product_key
       ,dim_date_key
       ,sku
       ,date
       ,purchase_cost
       ,dss_load_datetime
       ,dss_create_datetime
       ,dss_update_datetime
)
SELECT 
       demo_fact.dim_product_key AS dim_product_key 
      ,demo_fact.dim_date_key AS dim_date_key 
      ,demo_fact.sku AS sku 
      ,demo_fact.date AS date 
      ,demo_fact.purchase_cost AS purchase_cost 
      ,demo_fact.dss_load_datetime AS dss_load_datetime 
      ,@v_current_datetime AS dss_create_datetime 
      ,@v_current_datetime AS dss_update_datetime 
FROM   (
  SELECT demo_fact.*
  FROM   (
    SELECT 
           stage_demo_fact.dim_product_key AS dim_product_key 
          ,stage_demo_fact.dim_date_key AS dim_date_key 
          ,stage_demo_fact.sku AS sku 
          ,stage_demo_fact.date AS date 
          ,stage_demo_fact.purchase_cost AS purchase_cost 
          ,stage_demo_fact.dss_load_datetime AS dss_load_datetime 
    FROM [TABLEOWNER].[stage_demo_fact] stage_demo_fact
  ) AS demo_fact
  WHERE NOT EXISTS (
          SELECT 1
          FROM   [TABLEOWNER].[demo_fact] demo_fact__not_exist
          WHERE  demo_fact__not_exist.sku = demo_fact.sku
                 AND demo_fact__not_exist.date = demo_fact.date
  )
) AS demo_fact

Implementation

Download from the template section the following templates for WhereScape written in Pebble:

  • nu3_sqlserver_procedure_fact
  • nu3_sqlserver_block_fact
  • nu3_sqlserver_procedure_stage (updated)

Select the template in the properties.

For any dimension, for which we don’t want to have missing business keys automatically created, modify the extended property on the dimension insert_missing_bk = false (see blog post).

That’s it.