Dez 2 2017
Usually for a start, somebody starts with an easy template. The stage table is the most difficult one. The first running version was done in a few days. But perfecting it and adding more features has taken us months!
The stage table is used everywhere. So, I’m starting with it. Where do we use a stage table?
Pretty everywhere. What can we do with a stage table?
- Load data from any source table
- Transform data the way we want it
- Prepare data for the target table
Loading & Transforming is done easily in RED. Drag & Drop a column from the source table into the stage table. Adding a transformation if we need that. Done!
The energy is used to prepare the data for the target table. What do they need?
|Target Table Type||Requirements|
|History||Adding a Change Hash for performance boost
Removing duplicates (Optional)
|Hub & Link||Cleaning of the Business Key
Adding a Hub Hash Key
Adding a Link Hash Key
|Satellite||Adding a Change Hash|
|Data Mart||Nothing special|
|Fact||Lookup of the Dimension Key
We have the option to create a stage template for each of it. Or combine it as they don’t much differ and doesn’t affect performance if one feature is not used.
Lets discuss those requirements in-depth.
The goal in loading data to the history table is, to only persist changes over time. How to accomplish that? Either you compare each of your columns from the stage table to the history table. This results in a huge
WHERE clause in the loading statement. Performance wise this is really bad. SQL Server has to compare each column and this takes performance. How to make it faster? Calculate a hash value of all values concatenated. Then SQL Server has to compare only one column. The disadvantage is, that the hash calculation takes time. But much less than comparing each of the columns to the target.
Hub & Link
Hub & Link form only an abstraction layer from the data. We can group data in satellites around a business entity. What does it need? It needs a clean business key. If it is a number, easy. But what about text? There might be spaces around it, mixed cases and even characters, which SQL Server won’t recognize as a difference from one another if the column is not set to a binary collation (e.g. „ß“ and „ss“). And
NULL values are no good for a business key either. Can you give me more information about a customer who doesn’t exist?
We need to clean the business key, so it is understood from all perspectives and doesn’t lead to duplicates (country with space and not).
After the cleaning, we add a hash value of them. In Data Vault 1.0 it was a surrogate key. But with a hash we can load data in parallel into any destination.
As above in history. Gain performance in loading data into a satellite. And it belongs to the Data Vault standard.
Nothing special for the stage table. Loading data to dimension is written about later.
If the joining from a fact table to a dimension table is based on business key, nothing must be done. But if we use a surrogate key in the dimension table as many people do, we need to look up this value and add it to the stage table.
How to order all those requirements?
I must admit, that the initial design originates from Dan Linstedt. First, I had more steps in between and a different order. I thought that I am cleverer… But, (sigh) over the time I came to the same result.
At the beginning of my journey, I had the cleaning of business keys in the transformations. But sometimes I forgot to clean them. Just recently I found a way to clean them through the templates.
Which steps did I design to get all that in a fast working data flow:
TRUNCATE TABLE stage_table
Temporary Table (only for tables which needs hashing)
INSERT INTO #temp_table
SELECT + TRANSFORM from source_table
INSERT INTO stage_table
SUB-SELECT Dimension Lookup
SUB-SELECT Remove true duplicates
SUB-SELECT Calculate Hash Keys and Change Hash
SUB-SELECT Clean Business Key for Hash Keys
SUB-SELECT + Transform from source_table or #temp_table
Do you wonder, why I create a temporary table? As I load a lot of incremental data, I came across an odd behavior of SQL Server. Sometimes it would calculate every hash from the whole source table and then filtering out the incremental data. E.g. calculating hashes for 100 million rows and taking the last 100.000. It believes that calculating hashes are done in no time. Even if I re-sort the sub-queries it still comes up with the same execution plan. I need this #temp_table as an invisible stage table to persist data so it will hash only the incremental data.
With using Sub-Selects, I can focus on each part to give a perfect result back. I don’t need to mix things up. Initially we had those sub-selects as views. A lot of objects! With templates, it is now a no-brainer.
Extract & Transform
The template takes the source column or the transformation on it.
Clean Business Key for Hash Keys (other Business Keys are not touched)
There are many options out there. I decided to go with the following:
|CHAR Data Types||
|INT Data Types||
CHAR column is empty I set it to
NULL and every
NULL column is set to -1. Then remove any spaces and set it to upper case.
INT column is empty set it to -1. 0 (=zero) is a good value. No need to NULL it.
On SQL Server take a
BINARY data type to store the hash. It uses half the space, it’s the default output and SQL Server doesn’t need to cast it to
CHAR. Much better performance. Hashing is already slow. No need to make it slower. If you don’t know how to query for a binary value, just add the value without quotes, like an integer value.
SELECT * FROM hub_product WHERE hub_product = 0x02D...
The template can handle both. If you have it in
CHAR it will cast it, otherwise not.
Compute Hash Keys and Change Hash
Casting everything to
NVARCHAR(MAX) is the easiest thing.
NVARCHAR has the best implicit conversion from one data type to string. Looking beautiful. Only
DATETIME and another (which I can’t remember) are looking ugly. But as I use only
DATETIME2 I don’t come across. My first version was to give for each data type its own string. But this is much easier to understand and use. Don’t ever cast to
VARCHAR(MAX) only. I lost data with that. As long as there is no true UTF8 character set (with variable byte length) for
VARCHAR, stick with UTF16 in
For Hash Keys, everything is already cleaned in the previous step. When I had the cleaning in the first step at select & transform, I occasionally forgot to add the cleaning. I saved bad business keys because they were not cleaned. But now they are. Always! I love to not think about obstacles.
For Change Hash, we just must make sure that we don’t get
NULLs, the whole string is emptied. Some are also using
RTRIM. But then I would lose a change from the source system which I don’t like. I analyzed the impact and found it negligible.
Ah, before I forget: I sort the column names by alphabet for hashing. With that I’m free to reorder any column (what we shouldn’t do, but do sometimes) or when I maybe need to rehash stuff at any given time later.
Remove true duplicates (NEVER DO THAT!)
It is included, but be aware. REALLY! Wait until you hit a duplicate and talk to the owner of your source system! And only then implement it, if he can’t fix it.
I use it only for a stage table to load data to history. For satellites, I give you later another option.
ROW_NUMBER() OVER (PARTITION BY <business key columns> ORDER BY dss_load_datetime DESC, dss_sequence_no DESC)
With that I’m getting the last version of a business key. The
dss_sequence_no is a column I use for history loading. If you don’t have them, it won’t show up.
Performance wise a
ROW_NUMBER() calculation is a really bad thing to do. It slows the whole execution plan and kills any parallelism. But it is the easiest option. I tried another one which doesn’t wait until the internal sort of the
ROW_NUMBER() is over. But it takes a little more CPU and it can’t get modeled in this sub-query context.
Just looking up the key.
COALESCE(<dimension table>.<artificial key>,0)
If you need to add missing dimension keys to the dimension table, I’m going to blog later about a solution. In templates, we can’t address this
get_dimension_key procedure. Anyway, they are not needed anymore.
After all those preparation, how will the resulted query look like?
INSERT INTO [TABLEOWNER].[stage_sat_demo] WITH ( TABLOCKX ) ( hub_request ,shop_code ,session_id ,request_id ,times ,referrer ,url ,search_engine ,dss_load_datetime ,dss_record_source ,dss_change_hash ,dss_create_datetime ,dss_update_datetime ) SELECT stage_sat_demo.hub_request AS hub_request ,stage_sat_demo.shop_code AS shop_code ,stage_sat_demo.session_id AS session_id ,stage_sat_demo.request_id AS request_id ,stage_sat_demo.times AS times ,stage_sat_demo.referrer AS referrer ,stage_sat_demo.url AS url ,stage_sat_demo.search_engine AS search_engine ,stage_sat_demo.dss_load_datetime AS dss_load_datetime ,stage_sat_demo.dss_record_source AS dss_record_source ,stage_sat_demo.dss_change_hash AS dss_change_hash ,@v_current_datetime AS dss_create_datetime ,@v_current_datetime AS dss_update_datetime FROM ( SELECT stage_sat_demo.* FROM ( SELECT * ,ROW_NUMBER() OVER (PARTITION BY hub_request , shop_code , session_id , request_id ORDER BY dss_load_datetime DESC) AS dss_row_no FROM ( SELECT * ,HASHBYTES('md5', CAST(shop_code AS NVARCHAR(MAX)) +'||'+ CAST(session_id AS NVARCHAR(MAX)) +'||'+ CAST(request_id AS NVARCHAR(MAX)) ) AS hub_request ,HASHBYTES('md5', COALESCE(CAST(referrer AS NVARCHAR(MAX)),'') +'||'+ COALESCE(CAST(search_engine AS NVARCHAR(MAX)),'') +'||'+ COALESCE(CAST(times AS NVARCHAR(MAX)),'') +'||'+ COALESCE(CAST(url AS NVARCHAR(MAX)),'') ) AS dss_change_hash FROM ( SELECT UPPER(LTRIM(RTRIM(COALESCE(NULLIF(shop_code,''),'-1')))) AS shop_code ,COALESCE(session_id,-1) AS session_id ,COALESCE(request_id,-1) AS request_id ,times AS times ,referrer AS referrer ,url AS url ,search_engine AS search_engine ,dss_load_datetime AS dss_load_datetime ,dss_record_source AS dss_record_source FROM ( SELECT source_sat_demo.shop_code AS shop_code ,source_sat_demo.sid AS session_id ,source_sat_demo.request_id AS request_id ,source_sat_demo.times AS times ,source_sat_demo.referrer AS referrer ,source_sat_demo.url AS url ,source_sat_demo.search_engine AS search_engine ,DATEADD(NS,100*dss_sequence_no,dss_load_datetime) AS dss_load_datetime ,source_sat_demo.dss_record_source AS dss_record_source INTO #stage_sat_demo FROM [TABLEOWNER].[source_sat_demo] WHERE dss_load_datetime > ( SELECT ISNULL(MAX(t.dss_load_datetime) , '0001-01-01') FROM [TABLEOWNER].[sat_demo] AS t ) ) AS stage_sat_demo ) AS stage_sat_demo ) AS stage_sat_demo ) AS stage_sat_demo WHERE template_stage_sat.dss_row_no = 1 ) AS stage_sat_demo
Now we have seen what can be done with it. How to set it up? Add a stage table. In the preferences, there is no need to distinguish between table types (Stage, Data Vault Stage, Permanent Stage, Work Table). Stage is just fine for every data load.
Set Rebuild with template.
To remove duplicates, I thought the best place would be the Business Key Columns. NEVER define them! Only if you want to get rid of duplicates. It will by default create an index to enforce it.
Define the Source JOIN and WHERE clause as you need.
That’s it! It’s a sleek simple template without all the whistles available from WhereScape. But it serves our purpose of pumping data easily to a stage table without to think about any special treatments.
Please leave a comment if you like it, are going to use it and if you find bugs.