Universal Loading Pattern

While working on the templates I published recently, I came across similar looking pattern between object types. E.g. the persisted history of a satellite looks the same as in a persistent staging area (except of the primary key of course). Or loading a dimension is like loading a hub (just that hubs have no change hash). So, a dimension loading pattern would nearly work for hubs.

I concluded that instead of having 7 templates (for each object type) 2 would be enough. One for stage and one for any target. They have all similar patterns.

While compiling all the patterns, I came to a solution that even 1 template would be enough. A universal loading pattern. Usable for any target type!

As we are loading data with stored procedures generated with WhereScape, see the following captions as sub-queries. In that way we add value in each layer to the result set. Also for maintaining template code this is much easier. SQL Server will find the best way to execute them.

I separated them into stage and target loading for easier reading. But in reality, it is just one big data flow.

Stage Loading Process

What would the loading pattern for a stage object look like?

From my first draft only little changed.

Extract & Transform

The only user generated logic is defined here. A simple SELECT statement and some additional meta data about what to save in addition, e.g. hash keys, change hashes or dimension keys. Everything afterwards is automated and for data preparation.

Delta Loading

[themify_box icon=“highlight“]Target Object Name is defined[/themify_box]

A stage table can optionally have a link to its target. With that in place delta loading is possible and will add the following WHERE clause.

WHERE dss_load_datetime > (SELECT COALSESCE(MAX(dss_load_datetime),'0001-01-01') FROM [Target])

CAST to Target Data Type

[themify_box icon=“highlight“]Has Hash Columns[/themify_box]

Hashing is a beast of itself. It looks trivial, but can still be tricky. Just recently we figured out, we would need to add another step in between: Casting to Target Data Type. E.g. the source column is of data type datetime and the target column has date. Of course, nobody does that ;-)! Hashing without casting would mean that it hashes e.g. ‚2018-03-08 12:13:14‘ instead of just the date ‚2018-03-08‘. Any rehashing later will fail. Of course, it should have been taken care about in the transformation, but other examples can be found. This step is just adding the invisible implicit transformation to a visual explicit transformation before the hashing step.

Another way to model it, is to create a temporary table with the same data types as the stage table and persist the result before going into the hashing function. This is anyway faster in delta loading because SQL Server comes usually up with odd execution plans.

Clean Business Key for Hash Keys (other Business Keys are not touched)

[themify_box icon=“highlight“]Has Hash Columns[/themify_box]

There are many options out there. I decided to go with the following:

CHAR Data TypesISNULL(NULLIF(UPPER(LTRIM(RTRIM(source_column))),''),'-1')
INT Data TypesISNULL(source_column,-1)
DATE Data TypesISNULL(source_column,'0001-01-01')

If a CHAR column is empty I set it to NULL and every NULL column is set to -1. Then remove any spaces and set it to upper case.

If an INT column is empty set it to -1. 0 (=zero) is a good value. No need to NULL it.

Compute Hash Keys and Change Hash

[themify_box icon=“highlight“]Has Hash Columns[/themify_box]

Hash KeyHASHBYTES('md5',
    CAST(column_1 AS NVARCHAR(MAX)) +'||'+
    CAST(column_x AS NVARCHAR(MAX))
)

Change HashHASHBYTES('md5',
    ISNULL(CAST(column_1 AS NVARCHAR(MAX)),'') +'||'+
    ISNULL(CAST(column_x AS NVARCHAR(MAX)),'')
)

Casting everything to NVARCHAR(MAX) is the easiest thing. NVARCHAR has the best implicit conversion from any data type to string. Don’t ever cast to VARCHAR(MAX) only. As long as there is no true UTF8 character set (with variable byte length) for VARCHAR, stick with UTF16 in NVARCHAR. Looking beautiful. Only DATETIME and another (which I can’t remember) are looking ugly. But as I use only DATETIME2 I don’t come across.

Dimension Lookup

[themify_box icon=“highlight“]Has Dimension Columns[/themify_box]

Just looking up the surrogate key.

Row Condensing

[themify_box icon=“highlight“]Business Key is defined[/themify_box]

Removing true duplicates. Be aware. REALLY! Wait until you hit a duplicate and talk to the owner of your source system! And only then implement it, if he can’t fix it.

Load

Preparing everything from the previous steps and sort the columns in the right order for the INSERT statement.

Post-Process Update Dimension Key

[themify_box icon=“highlight“]Has Dimension Columns[/themify_box]

If a business key in the stage table for a fact object doesn’t exist in the dimension we could add it if this is a desired outcome.

After INSERTing the missing business key into the dimension, we need to update the stage table.

Target Loading Process

The target loading can be as simple as having a short INSERT statement.

Which differences are in the target object types:

  • Persist timeline calculation? (e.g. having EndDate in a table or calculate it through views)
  • Update existing or insert as new row ?(e.g. Dimension vs Satellite)
  • Is it allowed to insert „in-between“ data? (e.g. Satellite)

Row Condensing

Here happens the real magic, where every target type is different. Now is the right time to figure out, if we need to persist all the rows.

Target TypeRule
StageWe need only the last record by Business Key (if it is specified)
Data StoreWe need only changes in rows sorted by LoadDate
HubWe need only the first record by Hub Hash Key
LinkWe need only the first record by Link Hash Key
SatelliteWe need only changes in rows sorted by LoadDate and sorted by EventDate.
DimensionDimensions without SCD should be already unique in stage. No condensing.
FactIt should be already unique by Business Key in stage. No condensing.

 

Compare last saved row with first new row

[themify_box icon=“highlight“]Has StartDate Column[/themify_box]

An additional step is to check the last saved row with the first new row. Happens only in tables with a timeline column. Usually Satellites or History.

The result after this step is, that the data set is now cleaned, useful, ready to be inserted in the target. From a data point of view.

History Columns

[themify_box icon=“highlight“]Has History Columns[/themify_box]

The history columns can be deferred to a view. Then we have an INSERT only system which can have a huge performance gain. The question is, if we can wait that long for the calculation of the timeline in the view to finish. If not, we should persist the history columns.

Column names for the first timeline. The calculation is based on the LoadDate.

  • dss_start_datetime
  • dss_end_datetime
  • dss_version
  • dss_current_flag

Column names for the second timeline: The calculation is based on the EventDate.

  • dss_effectivity_datetime
  • dss_expiry_datetime

For templates, not all columns are needed. Just add those which you need.

I know, this process fails if we are talking about bi-temporal objects. I need to learn first, how to build and use them.

Load

This step brings all the data from the previous steps together and adds the columns, which were missing in the extract layer.

Where (Not) Exists

This step was the most difficult one. Not because of detecting what exists and what not, but where to place?

If I put it after the extract layer, it would be impossible to do a full load. All existing records are removed from the data set and the history calculation is therefore rubbish. Also in delta load are problems. What about loading data in-between LoadDates? The whole row condensing step will produce wrong results.

What about placing it after the row condensing layer? Seems to me the best place. I could save time calculating the history layer as I’m not going to update an existing row anyway. But I would need to UPDATE the timeline after the INSERT. Any single UPDATE costs me 10 times more than a proper INSERT.

And where do we have history calculations anyway? Only with satellites and data store. Therefore, it can be placed at the last place possible.

What is new and what is existing? This are the rules for detecting it:

Target TypeDefinition
Data StoreBusiness Key, Load Date and Sequence No
HubHub Hash Key and Business Key (the unique constrain will tell if there is a hash collision)
LinkLink Hash Keys, Hub Hash Keys and peg-legged Business Keys (the unique constrain will tell if there is a hash collision)
SatelliteHub Hash Key or Link Hash Key and LoadDate
DimensionBusiness Key
FactBusiness Key

Which objects could have an UPDATE statement? Maybe not only in delta but also in full load?

Target TypeDefinition
Data StoreNo, there can't be any changes in the data. Otherwise the LoadDate is used wrongly.
HubNo. If it is of interest, when a business key became available, we can update the LoadDate if there is an earlier one.
LinkNo. If it is of interest, when a business key became available, we can update the LoadDate if there is an earlier one.
SatelliteNo. If by accident there are different Change Hashes for a given record which already exists, the whole calculation in stage must have changed. Then the whole satellite should be reloaded instead.
DimensionYes!
FactYes!

For Facts DELETEing the record could be a faster option than UPDATEing it.

Everything else is going to be INSERTed.

Post-Process Update Timeline

[themify_box icon=“highlight“]Has History Columns for a timeline (StartDate only is not a timeline)[/themify_box]

In that step we are going to update the timelines of our target. By joining to the stage table, we only need to update the changed timelines per Business Key/Driving Key.

In our thinking we had first also a Pre-Process Update step where we would outdate old data. This works, if we run the INSERT batch-by-batch. But as we don’t know, if there will be any inserts at all after the „Row Condensing“ layer we must omit that pre-processing step.

Conclusion

Above you see 2 images. Actually, in my data flow they are not separated. It is just a huge select statement incorporating every rule. Of course, empty layers are not shown.

Now, I can just throw any data for any target type at it. By designing it in layers, I’m flexible in adding new rules when I discover them. I’m also able to take out each layer in a separate temporary table to speed up processing time, if I figure out that this way would be faster.

By writing down my thinking gave clarity to the desired outcome. If you want to discuss this model or want to share your experience with me, please get in touch.