Apr 7 2018
While working on the templates I published recently, I came across similar looking pattern between object types. E.g. the persisted history of a satellite looks the same as in a persistent staging area (except of the primary key of course). Or loading a dimension is like loading a hub (just that hubs have no change hash). So, a dimension loading pattern would nearly work for hubs.
I concluded that instead of having 7 templates (for each object type) 2 would be enough. One for stage and one for any target. They have all similar patterns.
While compiling all the patterns, I came to a solution that even 1 template would be enough. A universal loading pattern. Usable for any target type!
As we are loading data with stored procedures generated with WhereScape, see the following captions as sub-queries. In that way we add value in each layer to the result set. Also for maintaining template code this is much easier. SQL Server will find the best way to execute them.
I separated them into stage and target loading for easier reading. But in reality, it is just one big data flow.
Stage Loading Process
What would the loading pattern for a stage object look like?
From my first draft only little changed.
Extract & Transform
The only user generated logic is defined here. A simple
SELECT statement and some additional meta data about what to save in addition, e.g. hash keys, change hashes or dimension keys. Everything afterwards is automated and for data preparation.
A stage table can optionally have a link to its target. With that in place delta loading is possible and will add the following
WHERE dss_load_datetime > (SELECT COALSESCE(MAX(dss_load_datetime),'0001-01-01') FROM [Target])
CAST to Target Data Type
Hashing is a beast of itself. It looks trivial, but can still be tricky. Just recently we figured out, we would need to add another step in between: Casting to Target Data Type. E.g. the source column is of data type
datetime and the target column has
date. Of course, nobody does that ;-)! Hashing without casting would mean that it hashes e.g. ‘2018-03-08 12:13:14’ instead of just the date ‘2018-03-08’. Any rehashing later will fail. Of course, it should have been taken care about in the transformation, but other examples can be found. This step is just adding the invisible implicit transformation to a visual explicit transformation before the hashing step.
Another way to model it, is to create a temporary table with the same data types as the stage table and persist the result before going into the hashing function. This is anyway faster in delta loading because SQL Server comes usually up with odd execution plans.
Clean Business Key for Hash Keys (other Business Keys are not touched)
There are many options out there. I decided to go with the following:
|CHAR Data Types||
|INT Data Types||
|DATE Data Types||
CHAR column is empty I set it to
NULL and every
NULL column is set to -1. Then remove any spaces and set it to upper case.
INT column is empty set it to -1. 0 (=zero) is a good value. No need to NULL it.
Compute Hash Keys and Change Hash
Casting everything to
NVARCHAR(MAX) is the easiest thing.
NVARCHAR has the best implicit conversion from any data type to string. Don’t ever cast to
VARCHAR(MAX) only. As long as there is no true UTF8 character set (with variable byte length) for
VARCHAR, stick with UTF16 in
NVARCHAR. Looking beautiful. Only
DATETIME and another (which I can’t remember) are looking ugly. But as I use only
DATETIME2 I don’t come across.
Just looking up the surrogate key.
Removing true duplicates. Be aware. REALLY! Wait until you hit a duplicate and talk to the owner of your source system! And only then implement it, if he can’t fix it.
Preparing everything from the previous steps and sort the columns in the right order for the INSERT statement.
Post-Process Update Dimension Key
If a business key in the stage table for a fact object doesn’t exist in the dimension we could add it if this is a desired outcome.
INSERTing the missing business key into the dimension, we need to update the stage table.
Target Loading Process
The target loading can be as simple as having a short
Which differences are in the target object types:
- Persist timeline calculation? (e.g. having EndDate in a table or calculate it through views)
- Update existing or insert as new row ?(e.g. Dimension vs Satellite)
- Is it allowed to insert “in-between” data? (e.g. Satellite)
Here happens the real magic, where every target type is different. Now is the right time to figure out, if we need to persist all the rows.
|Stage||We need only the last record by Business Key (if it is specified)|
|Data Store||We need only changes in rows sorted by LoadDate|
|Hub||We need only the first record by Hub Hash Key|
|Link||We need only the first record by Link Hash Key|
|Satellite||We need only changes in rows sorted by LoadDate and sorted by EventDate.|
|Dimension||Dimensions without SCD should be already unique in stage. No condensing.|
|Fact||It should be already unique by Business Key in stage. No condensing.|
Compare last saved row with first new row
An additional step is to check the last saved row with the first new row. Happens only in tables with a timeline column. Usually Satellites or History.
The result after this step is, that the data set is now cleaned, useful, ready to be inserted in the target. From a data point of view.
The history columns can be deferred to a view. Then we have an
INSERT only system which can have a huge performance gain. The question is, if we can wait that long for the calculation of the timeline in the view to finish. If not, we should persist the history columns.
Column names for the first timeline. The calculation is based on the LoadDate.
Column names for the second timeline: The calculation is based on the EventDate.
For templates, not all columns are needed. Just add those which you need.
I know, this process fails if we are talking about bi-temporal objects. I need to learn first, how to build and use them.
This step brings all the data from the previous steps together and adds the columns, which were missing in the extract layer.
Where (Not) Exists
This step was the most difficult one. Not because of detecting what exists and what not, but where to place?
If I put it after the extract layer, it would be impossible to do a full load. All existing records are removed from the data set and the history calculation is therefore rubbish. Also in delta load are problems. What about loading data in-between LoadDates? The whole row condensing step will produce wrong results.
What about placing it after the row condensing layer? Seems to me the best place. I could save time calculating the history layer as I’m not going to update an existing row anyway. But I would need to
UPDATE the timeline after the
INSERT. Any single
UPDATE costs me 10 times more than a proper
And where do we have history calculations anyway? Only with satellites and data store. Therefore, it can be placed at the last place possible.
What is new and what is existing? This are the rules for detecting it:
|Data Store||Business Key, Load Date and Sequence No|
|Hub||Hub Hash Key and Business Key (the unique constrain will tell if there is a hash collision)|
|Link||Link Hash Keys, Hub Hash Keys and peg-legged Business Keys (the unique constrain will tell if there is a hash collision)|
|Satellite||Hub Hash Key or Link Hash Key and LoadDate|
Which objects could have an
UPDATE statement? Maybe not only in delta but also in full load?
|Data Store||No, there can't be any changes in the data. Otherwise the LoadDate is used wrongly.|
|Hub||No. If it is of interest, when a business key became available, we can update the LoadDate if there is an earlier one.|
|Link||No. If it is of interest, when a business key became available, we can update the LoadDate if there is an earlier one.|
|Satellite||No. If by accident there are different Change Hashes for a given record which already exists, the whole calculation in stage must have changed. Then the whole satellite should be reloaded instead.|
DELETEing the record could be a faster option than
Everything else is going to be
Post-Process Update Timeline
In that step we are going to update the timelines of our target. By joining to the stage table, we only need to update the changed timelines per Business Key/Driving Key.
In our thinking we had first also a Pre-Process Update step where we would outdate old data. This works, if we run the
INSERT batch-by-batch. But as we don’t know, if there will be any inserts at all after the “Row Condensing” layer we must omit that pre-processing step.
Above you see 2 images. Actually, in my data flow they are not separated. It is just a huge select statement incorporating every rule. Of course, empty layers are not shown.
Now, I can just throw any data for any target type at it. By designing it in layers, I’m flexible in adding new rules when I discover them. I’m also able to take out each layer in a separate temporary table to speed up processing time, if I figure out that this way would be faster.
By writing down my thinking gave clarity to the desired outcome. If you want to discuss this model or want to share your experience with me, please get in touch.