Template: Stage

Usually for a start, somebody starts with an easy template. The stage table is the most difficult one. The first running version was done in a few days. But perfecting it and adding more features has taken us months!

The stage table is used everywhere. So, I’m starting with it. Where do we use a stage table?

Pretty everywhere. What can we do with a stage table?

  • Load data from any source table
  • Transform data the way we want it
  • Prepare data for the target table

Loading & Transforming is done easily in RED. Drag & Drop a column from the source table into the stage table. Adding a transformation if we need that. Done!

The energy is used to prepare the data for the target table. What do they need?

Target Table TypeRequirements
HistoryAdding a Change Hash for performance boost
Removing duplicates (Optional)
Hub & LinkCleaning of the Business Key
Adding a Hub Hash Key
Adding a Link Hash Key
SatelliteAdding a Change Hash
Data MartNothing special
FactLookup of the Dimension Key

We have the option to create a stage template for each of it. Or combine it as they don’t much differ and doesn’t affect performance if one feature is not used.

Lets discuss those requirements in-depth.

Requirements

History

The goal in loading data to the history table is, to only persist changes over time. How to accomplish that? Either you compare each of your columns from the stage table to the history table. This results in a huge WHERE clause in the loading statement. Performance wise this is really bad. SQL Server has to compare each column and this takes performance. How to make it faster? Calculate a hash value of all values concatenated. Then SQL Server has to compare only one column. The disadvantage is, that the hash calculation takes time. But much less than comparing each of the columns to the target.

Hub & Link

Hub & Link form only an abstraction layer from the data. We can group data in satellites around a business entity. What does it need? It needs a clean business key. If it is a number, easy. But what about text? There might be spaces around it, mixed cases and even characters, which SQL Server won’t recognize as a difference from one another if the column is not set to a binary collation (e.g. „ß“ and „ss“). And NULL values are no good for a business key either. Can you give me more information about a customer who doesn’t exist?

We need to clean the business key, so it is understood from all perspectives and doesn’t lead to duplicates (country with space and not).

After the cleaning, we add a hash value of them. In Data Vault 1.0 it was a surrogate key. But with a hash we can load data in parallel into any destination.

Satellite

As above in history. Gain performance in loading data into a satellite. And it belongs to the Data Vault standard.

Dimension

Nothing special for the stage table. Loading data to dimension is written about later.

Fact

If the joining from a fact table to a dimension table is based on business key, nothing must be done. But if we use a surrogate key in the dimension table as many people do, we need to look up this value and add it to the stage table.

Pipeline

How to order all those requirements?

I must admit, that the initial design originates from Dan Linstedt. First, I had more steps in between and a different order. I thought that I am cleverer… But, (sigh) over the time I came to the same result.

At the beginning of my journey, I had the cleaning of business keys in the transformations. But sometimes I forgot to clean them. Just recently I found a way to clean them through the templates.

Stored Procedure

Which steps did I design to get all that in a fast working data flow:

  • TRUNCATE TABLE stage_table
  • Temporary Table (only for tables which needs hashing)
    • INSERT INTO #temp_table
    • SELECT + TRANSFORM from source_table
  • Target Table
    • INSERT INTO stage_table
    • SELECT Result
    • SUB-SELECT Dimension Lookup
    • SUB-SELECT Remove true duplicates
    • SUB-SELECT Calculate Hash Keys and Change Hash
    • SUB-SELECT Clean Business Key for Hash Keys
    • SUB-SELECT + Transform from source_table or #temp_table

Do you wonder, why I create a temporary table? As I load a lot of incremental data, I came across an odd behavior of SQL Server. Sometimes it would calculate every hash from the whole source table and then filtering out the incremental data. E.g. calculating hashes for 100 million rows and taking the last 100.000. It believes that calculating hashes are done in no time. Even if I re-sort the sub-queries it still comes up with the same execution plan. I need this #temp_table as an invisible stage table to persist data so it will hash only the incremental data.

With using Sub-Selects, I can focus on each part to give a perfect result back. I don’t need to mix things up. Initially we had those sub-selects as views. A lot of objects! With templates, it is now a no-brainer.

Extract & Transform

The template takes the source column or the transformation on it.

Clean Business Key for Hash Keys (other Business Keys are not touched)

There are many options out there. I decided to go with the following:

CHAR Data TypesUPPER(LTRIM(RTRIM(ISNULL(NULLIF(source_column,''),'-1'))))
INT Data TypesISNULL(source_column,-1)

If a CHAR column is empty I set it to NULL and every NULL column is set to -1. Then remove any spaces and set it to upper case.

If an INT column is empty set it to -1. 0 (=zero) is a good value. No need to NULL it.

On SQL Server take a BINARY data type to store the hash. It uses half the space, it’s the default output and SQL Server doesn’t need to cast it to CHAR. Much better performance. Hashing is already slow. No need to make it slower. If you don’t know how to query for a binary value, just add the value without quotes, like an integer value.

SELECT * FROM hub_product WHERE hub_product = 0x02D...

The template can handle both. If you have it in CHAR it will cast it, otherwise not.

Compute Hash Keys and Change Hash

Hash KeyHASHBYTES('md5',
    CAST(column_1 AS NVARCHAR(MAX)) +'||'+
    CAST(column_x AS NVARCHAR(MAX))
)

Change HashHASHBYTES('md5',
    COALESCE(CAST(column_1 AS NVARCHAR(MAX)),'') +'||'+
    COALESCE(CAST(column_x AS NVARCHAR(MAX)),'')
)

Casting everything to NVARCHAR(MAX) is the easiest thing. NVARCHAR has the best implicit conversion from one data type to string. Looking beautiful. Only DATETIME and another (which I can’t remember) are looking ugly. But as I use only DATETIME2 I don’t come across. My first version was to give for each data type its own string. But this is much easier to understand and use. Don’t ever cast to VARCHAR(MAX) only. I lost data with that. As long as there is no true UTF8 character set (with variable byte length) for VARCHAR, stick with UTF16 in NVARCHAR.

For Hash Keys, everything is already cleaned in the previous step. When I had the cleaning in the first step at select & transform, I occasionally forgot to add the cleaning. I saved bad business keys because they were not cleaned. But now they are. Always! I love to not think about obstacles.

For Change Hash, we just must make sure that we don’t get NULLs. With NULLs, the whole string is emptied. Some are also using RTRIM. But then I would lose a change from the source system which I don’t like. I analyzed the impact and found it negligible.

Ah, before I forget: I sort the column names by alphabet for hashing. With that I’m free to reorder any column (what we shouldn’t do, but do sometimes) or when I maybe need to rehash stuff at any given time later.

Remove true duplicates (NEVER DO THAT!)

It is included, but be aware. REALLY! Wait until you hit a duplicate and talk to the owner of your source system! And only then implement it, if he can’t fix it.

I use it only for a stage table to load data to history. For satellites, I give you later another option.

ROW_NUMBER() OVER (PARTITION BY <business key columns> ORDER BY dss_load_datetime DESC, dss_sequence_no DESC)

With that I’m getting the last version of a business key. The dss_sequence_no is a column I use for history loading. If you don’t have them, it won’t show up.

Performance wise a ROW_NUMBER() calculation is a really bad thing to do. It slows the whole execution plan and kills any parallelism. But it is the easiest option. I tried another one which doesn’t wait until the internal sort of the ROW_NUMBER() is over. But it takes a little more CPU and it can’t get modeled in this sub-query context.

Dimension Lookup

Just looking up the key.

COALESCE(<dimension table>.<artificial key>,0)

If you need to add missing dimension keys to the dimension table, I’m going to blog later about a solution. In templates, we can’t address this get_dimension_key procedure. Anyway, they are not needed anymore.

Result

After all those preparation, how will the resulted query look like?

INSERT INTO [TABLEOWNER].[stage_sat_demo] WITH ( TABLOCKX )
(
       hub_request
      ,shop_code
      ,session_id
      ,request_id
      ,times
      ,referrer
      ,url
      ,search_engine
      ,dss_load_datetime
      ,dss_record_source
      ,dss_change_hash
      ,dss_create_datetime
      ,dss_update_datetime
)
SELECT 
       stage_sat_demo.hub_request AS hub_request 
      ,stage_sat_demo.shop_code AS shop_code 
      ,stage_sat_demo.session_id AS session_id 
      ,stage_sat_demo.request_id AS request_id 
      ,stage_sat_demo.times AS times 
      ,stage_sat_demo.referrer AS referrer 
      ,stage_sat_demo.url AS url 
      ,stage_sat_demo.search_engine AS search_engine 
      ,stage_sat_demo.dss_load_datetime AS dss_load_datetime 
      ,stage_sat_demo.dss_record_source AS dss_record_source 
      ,stage_sat_demo.dss_change_hash AS dss_change_hash 
      ,@v_current_datetime AS dss_create_datetime 
      ,@v_current_datetime AS dss_update_datetime 
FROM   (
  SELECT stage_sat_demo.*
  FROM   (
    SELECT *
          ,ROW_NUMBER() OVER (PARTITION BY hub_request , shop_code , session_id , request_id ORDER BY dss_load_datetime DESC) AS dss_row_no
    FROM   (
      SELECT *
            ,HASHBYTES('md5',
                CAST(shop_code AS NVARCHAR(MAX)) +'||'+
                CAST(session_id AS NVARCHAR(MAX)) +'||'+
                CAST(request_id AS NVARCHAR(MAX))
                ) AS hub_request
            ,HASHBYTES('md5',
                COALESCE(CAST(referrer AS NVARCHAR(MAX)),'') +'||'+
                COALESCE(CAST(search_engine AS NVARCHAR(MAX)),'') +'||'+
                COALESCE(CAST(times AS NVARCHAR(MAX)),'') +'||'+
                COALESCE(CAST(url AS NVARCHAR(MAX)),'')
                ) AS dss_change_hash
      FROM   (
        SELECT 
               UPPER(LTRIM(RTRIM(COALESCE(NULLIF(shop_code,''),'-1')))) AS shop_code
              ,COALESCE(session_id,-1) AS session_id
              ,COALESCE(request_id,-1) AS request_id
              ,times AS times
              ,referrer AS referrer
              ,url AS url
              ,search_engine AS search_engine
              ,dss_load_datetime AS dss_load_datetime
              ,dss_record_source AS dss_record_source
        FROM   (
          SELECT 
                 source_sat_demo.shop_code AS shop_code 
                ,source_sat_demo.sid AS session_id 
                ,source_sat_demo.request_id AS request_id 
                ,source_sat_demo.times AS times 
                ,source_sat_demo.referrer AS referrer 
                ,source_sat_demo.url AS url 
                ,source_sat_demo.search_engine AS search_engine 
                ,DATEADD(NS,100*dss_sequence_no,dss_load_datetime) AS dss_load_datetime 
                ,source_sat_demo.dss_record_source AS dss_record_source 
          INTO   #stage_sat_demo
          FROM   [TABLEOWNER].[source_sat_demo]
          WHERE  dss_load_datetime > ( 
            SELECT ISNULL(MAX(t.dss_load_datetime) , '0001-01-01')
            FROM   [TABLEOWNER].[sat_demo] AS t
          )
        ) AS stage_sat_demo
      ) AS stage_sat_demo
    ) AS stage_sat_demo
  ) AS stage_sat_demo
  WHERE template_stage_sat.dss_row_no = 1
) AS stage_sat_demo

Implementation

Now we have seen what can be done with it. How to set it up? Add a stage table. In the preferences, there is no need to distinguish between table types (Stage, Data Vault Stage, Permanent Stage, Work Table). Stage is just fine for every data load.

Set Rebuild with template.

Select the Template nu3_sqlserver_procedure_stage and the utility templates which you can grab from my GitHub repository at https://github.com/EckisWelt/WhereScape

To remove duplicates, I thought the best place would be the Business Key Columns. NEVER define them! Only if you want to get rid of duplicates. It will by default create an index to enforce it.

Define the Source JOIN and WHERE clause as you need.

That’s it! It’s a sleek simple template without all the whistles available from WhereScape. But it serves our purpose of pumping data easily to a stage table without to think about any special treatments.

Please leave a comment if you like it, are going to use it and if you find bugs.