Dez 2 2017
Template: Stage
Usually for a start, somebody starts with an easy template. The stage table is the most difficult one. The first running version was done in a few days. But perfecting it and adding more features has taken us months!
The stage table is used everywhere. So, I’m starting with it. Where do we use a stage table?
Pretty everywhere. What can we do with a stage table?
- Load data from any source table
- Transform data the way we want it
- Prepare data for the target table
Loading & Transforming is done easily in RED. Drag & Drop a column from the source table into the stage table. Adding a transformation if we need that. Done!
The energy is used to prepare the data for the target table. What do they need?
Target Table Type | Requirements |
---|---|
History | Adding a Change Hash for performance boost Removing duplicates (Optional) |
Hub & Link | Cleaning of the Business Key Adding a Hub Hash Key Adding a Link Hash Key |
Satellite | Adding a Change Hash |
Data Mart | Nothing special |
Fact | Lookup of the Dimension Key |
We have the option to create a stage template for each of it. Or combine it as they don’t much differ and doesn’t affect performance if one feature is not used.
Lets discuss those requirements in-depth.
Requirements
History
The goal in loading data to the history table is, to only persist changes over time. How to accomplish that? Either you compare each of your columns from the stage table to the history table. This results in a huge WHERE
clause in the loading statement. Performance wise this is really bad. SQL Server has to compare each column and this takes performance. How to make it faster? Calculate a hash value of all values concatenated. Then SQL Server has to compare only one column. The disadvantage is, that the hash calculation takes time. But much less than comparing each of the columns to the target.
Hub & Link
Hub & Link form only an abstraction layer from the data. We can group data in satellites around a business entity. What does it need? It needs a clean business key. If it is a number, easy. But what about text? There might be spaces around it, mixed cases and even characters, which SQL Server won’t recognize as a difference from one another if the column is not set to a binary collation (e.g. „ß“ and „ss“). And NULL
values are no good for a business key either. Can you give me more information about a customer who doesn’t exist?
We need to clean the business key, so it is understood from all perspectives and doesn’t lead to duplicates (country with space and not).
After the cleaning, we add a hash value of them. In Data Vault 1.0 it was a surrogate key. But with a hash we can load data in parallel into any destination.
Satellite
As above in history. Gain performance in loading data into a satellite. And it belongs to the Data Vault standard.
Dimension
Nothing special for the stage table. Loading data to dimension is written about later.
Fact
If the joining from a fact table to a dimension table is based on business key, nothing must be done. But if we use a surrogate key in the dimension table as many people do, we need to look up this value and add it to the stage table.
Pipeline
How to order all those requirements?
I must admit, that the initial design originates from Dan Linstedt. First, I had more steps in between and a different order. I thought that I am cleverer… But, (sigh) over the time I came to the same result.
At the beginning of my journey, I had the cleaning of business keys in the transformations. But sometimes I forgot to clean them. Just recently I found a way to clean them through the templates.
Stored Procedure
Which steps did I design to get all that in a fast working data flow:
TRUNCATE TABLE stage_table
Temporary Table (only for tables which needs hashing)
INSERT INTO #temp_table
SELECT + TRANSFORM from source_table
Target Table
INSERT INTO stage_table
SELECT Result
SUB-SELECT Dimension Lookup
SUB-SELECT Remove true duplicates
SUB-SELECT Calculate Hash Keys and Change Hash
SUB-SELECT Clean Business Key for Hash Keys
SUB-SELECT + Transform from source_table or #temp_table
Do you wonder, why I create a temporary table? As I load a lot of incremental data, I came across an odd behavior of SQL Server. Sometimes it would calculate every hash from the whole source table and then filtering out the incremental data. E.g. calculating hashes for 100 million rows and taking the last 100.000. It believes that calculating hashes are done in no time. Even if I re-sort the sub-queries it still comes up with the same execution plan. I need this #temp_table as an invisible stage table to persist data so it will hash only the incremental data.
With using Sub-Selects, I can focus on each part to give a perfect result back. I don’t need to mix things up. Initially we had those sub-selects as views. A lot of objects! With templates, it is now a no-brainer.
Extract & Transform
The template takes the source column or the transformation on it.
Clean Business Key for Hash Keys (other Business Keys are not touched)
There are many options out there. I decided to go with the following:
CHAR Data Types | ISNULL(NULLIF(UPPER(LTRIM(RTRIM(source_column))),''),'-1') |
INT Data Types | ISNULL(source_column,-1) |
DATE Data Types | ISNULL(source_column,'0001-01-01') |
If a CHAR
column is empty I set it to NULL
and every NULL
column is set to -1. Then remove any spaces and set it to upper case.
If an INT
column is empty set it to -1. 0 (=zero) is a good value. No need to NULL it.
On SQL Server take a BINARY
data type to store the hash. It uses half the space, it’s the default output and SQL Server doesn’t need to cast it to CHAR
. Much better performance. Hashing is already slow. No need to make it slower. If you don’t know how to query for a binary value, just add the value without quotes, like an integer value.
SELECT * FROM hub_product WHERE hub_product = 0x02D...
The template can handle both. If you have it in CHAR
it will cast it, otherwise not.
Compute Hash Keys and Change Hash
Hash Key | HASHBYTES('md5', |
Change Hash | HASHBYTES('md5', |
Casting everything to NVARCHAR(MAX)
is the easiest thing. NVARCHAR
has the best implicit conversion from one data type to string. Looking beautiful. Only DATETIME
and another (which I can’t remember) are looking ugly. But as I use only DATETIME2
I don’t come across. My first version was to give for each data type its own string. But this is much easier to understand and use. Don’t ever cast to VARCHAR(MAX)
only. I lost data with that. As long as there is no true UTF8 character set (with variable byte length) for VARCHAR
, stick with UTF16 in NVARCHAR
.
For Hash Keys, everything is already cleaned in the previous step. When I had the cleaning in the first step at select & transform, I occasionally forgot to add the cleaning. I saved bad business keys because they were not cleaned. But now they are. Always! I love to not think about obstacles.
For Change Hash, we just must make sure that we don’t get NULL
s. With NULL
s, the whole string is emptied. Some are also using RTRIM
. But then I would lose a change from the source system which I don’t like. I analyzed the impact and found it negligible.
Ah, before I forget: I sort the column names by alphabet for hashing. With that I’m free to reorder any column (what we shouldn’t do, but do sometimes) or when I maybe need to rehash stuff at any given time later.
Remove true duplicates (NEVER DO THAT!)
It is included, but be aware. REALLY! Wait until you hit a duplicate and talk to the owner of your source system! And only then implement it, if he can’t fix it.
I use it only for a stage table to load data to history. For satellites, I give you later another option.
ROW_NUMBER() OVER (PARTITION BY <business key columns> ORDER BY dss_load_datetime DESC, dss_sequence_no DESC)
With that I’m getting the last version of a business key. The dss_sequence_no
is a column I use for history loading. If you don’t have them, it won’t show up.
Performance wise a ROW_NUMBER()
calculation is a really bad thing to do. It slows the whole execution plan and kills any parallelism. But it is the easiest option. I tried another one which doesn’t wait until the internal sort of the ROW_NUMBER()
is over. But it takes a little more CPU and it can’t get modeled in this sub-query context.
Dimension Lookup
Just looking up the key.
COALESCE(<dimension table>.<artificial key>,0)
If you need to add missing dimension keys to the dimension table, I’m going to blog later about a solution. In templates, we can’t address this get_dimension_key
procedure. Anyway, they are not needed anymore.
Result
After all those preparation, how will the resulted query look like?
INSERT INTO [TABLEOWNER].[stage_sat_demo] WITH ( TABLOCKX )
(
hub_request
,shop_code
,session_id
,request_id
,times
,referrer
,url
,search_engine
,dss_load_datetime
,dss_record_source
,dss_change_hash
,dss_create_datetime
,dss_update_datetime
)
SELECT
stage_sat_demo.hub_request AS hub_request
,stage_sat_demo.shop_code AS shop_code
,stage_sat_demo.session_id AS session_id
,stage_sat_demo.request_id AS request_id
,stage_sat_demo.times AS times
,stage_sat_demo.referrer AS referrer
,stage_sat_demo.url AS url
,stage_sat_demo.search_engine AS search_engine
,stage_sat_demo.dss_load_datetime AS dss_load_datetime
,stage_sat_demo.dss_record_source AS dss_record_source
,stage_sat_demo.dss_change_hash AS dss_change_hash
,@v_current_datetime AS dss_create_datetime
,@v_current_datetime AS dss_update_datetime
FROM (
SELECT stage_sat_demo.*
FROM (
SELECT *
,ROW_NUMBER() OVER (PARTITION BY hub_request , shop_code , session_id , request_id ORDER BY dss_load_datetime DESC) AS dss_row_no
FROM (
SELECT *
,HASHBYTES('md5',
CAST(shop_code AS NVARCHAR(MAX)) +'||'+
CAST(session_id AS NVARCHAR(MAX)) +'||'+
CAST(request_id AS NVARCHAR(MAX))
) AS hub_request
,HASHBYTES('md5',
ISNULL(CAST(referrer AS NVARCHAR(MAX)),'') +'||'+
ISNULL(CAST(search_engine AS NVARCHAR(MAX)),'') +'||'+
ISNULL(CAST(times AS NVARCHAR(MAX)),'') +'||'+
ISNULL(CAST(url AS NVARCHAR(MAX)),'')
) AS dss_change_hash
FROM (
SELECT
ISNULL(NULLIF(UPPER(LTRIM(RTRIM(shop_code))),''),'-1') AS shop_code
,ISNULL(session_id,-1) AS session_id
,ISNULL(request_id,-1) AS request_id
,times AS times
,referrer AS referrer
,url AS url
,search_engine AS search_engine
,dss_load_datetime AS dss_load_datetime
,dss_record_source AS dss_record_source
FROM (
SELECT
source_sat_demo.shop_code AS shop_code
,source_sat_demo.sid AS session_id
,source_sat_demo.request_id AS request_id
,source_sat_demo.times AS times
,source_sat_demo.referrer AS referrer
,source_sat_demo.url AS url
,source_sat_demo.search_engine AS search_engine
,DATEADD(NS,100*dss_sequence_no,dss_load_datetime) AS dss_load_datetime
,source_sat_demo.dss_record_source AS dss_record_source
INTO #stage_sat_demo
FROM [TABLEOWNER].[source_sat_demo]
WHERE dss_load_datetime > (
SELECT ISNULL(MAX(t.dss_load_datetime) , '0001-01-01')
FROM [TABLEOWNER].[sat_demo] AS t
)
) AS stage_sat_demo
) AS stage_sat_demo
) AS stage_sat_demo
) AS stage_sat_demo
WHERE template_stage_sat.dss_row_no = 1
) AS stage_sat_demo
Implementation
Now we have seen what can be done with it. How to set it up? Add a stage table. In the preferences, there is no need to distinguish between table types (Stage, Data Vault Stage, Permanent Stage, Work Table). Stage is just fine for every data load.
Set Rebuild with template.
Select the Template nu3_sqlserver_procedure_stage and the utility templates which you can grab from my GitHub repository at https://github.com/EckisWelt/WhereScape
To remove duplicates, I thought the best place would be the Business Key Columns. NEVER define them! Only if you want to get rid of duplicates. It will by default create an index to enforce it.
Define the Source JOIN and WHERE clause as you need.
That’s it! It’s a sleek simple template without all the whistles available from WhereScape. But it serves our purpose of pumping data easily to a stage table without to think about any special treatments.
Please leave a comment if you like it, are going to use it and if you find bugs.
Dez 3 2017
Template: Stage (View)
What is the fastest way to ingest data? To persist or to stream? A view comes close to streaming data. Persisting is always kind of stop-and-go in my view.
I wanted to wait until the end of my little template story. But I couldn’t resist to talk about it now. I’m too excited about it!
When I started, I thought I could design either a table or a view alike. But all but one of those object types are fixed to
TABLE
and the view type isVIEW
. I created a ticket to give choice for any object to be created asTABLE
orVIEW
. Maybe it is still on the development list.What’s the story behind? I had a couple of stage tables for an export and I was building up a network of data snippets. Each of them persisted. What happens, when I need now to create a job? I must add all those little steps together and having to check for dependencies. Others would maybe build a custom stored procedure. But this kills parallelism. Why not creating little views?
Another one would be, why do I need to persist a staging table when it could be „in-transit“ to load a satellite? No need to wait for persistence.
Recently when I have been exploring the DDL function to create special purpose views. E.g. a „current“ view for satellites, it hit literally my face. Would it be possible to use the DDL function in any object to create views instead of tables?
I couldn’t wait. I copied an existing View DDL and added some code of my stage template. Done.
So easy and so obvious!
Now I can create stage tables and stage views with the same functionality.
Unfortunately, WhereScape has some safety checks around it (e.g. a view can’t be dropped when the object type expects a table). Hopefully they are going to fix that some when.
Download the template nu3_sqlserver_ddl_view_stage.peb and the utility templates from my GitHub repository at https://github.com/EckisWelt/WhereScape.
By ecki • Business Intelligence 0