Aug 30 2019
There was quite some noise, when I published last year my blog post about Universal Loading Pattern, a common view how to process data to any target. Be it a Data Vault Hub or a Fact table.
Since then I have worked with this pattern happily without any incidents.
Some time ago I thought, well, are these patterns still true? Did I forget a pattern? Or has my view towards these patterns changed? This falls fully in line being smart in accordance to Jeff Bezos. Before I read it, I fixed stuff „stealthily“. I didn’t want to appear not knowing it beforehand.
I restarted at the drawing board and have been pondering about my patterns. And then I waited… until some inspiring thoughts appeared. On the way to work, in a lunch break, in the night in the bed (where I should rather sleep than thinking about issues) or while feeding my newly born baby.
And I came up with some changes and surprises. And I believe that the following pattern are so easy to understand, that it is a shame that I didn’t came up with earlier. But that’s usual. First digging into it, exploring the edges, finding common ground to work with and then summarizing up.
In my last post I split up the pipeline into 2 parts for easier consumption. In one of my meetups I learnt that to view it that way is not common. It was caused by the way how WhereScape works.
Here I present the enhanced vendor neutral Universal Loading Pattern. Before we start, do you think Data Vault, Dimensional Modeling or Anchor Modelling are so different? They are just flavors to work with data! I believe I can load them all with this pattern.
Source Query Automation
Here are the bits and pieces, to help me with writing a query and taking care about all those nifty little details. But for that to work, we need a query.
Extract & Transform
This is the place where my input to the processing of my data starts. A simple
SELECT statement with
WHERE clauses. If I’m happy with the content and the calculation, I use it.
Now the interesting stuff starts.
For any of my targets I try to load and process only delta data. But I’m too tired to write that in my origin query. And I have an option implemented, to do a full load if I want to. I need to be flexible.
If I structure all the layers as sub queries, I can specify on the next layer just the following statement. SQL Server will still make the best of it and will apply the filter already to the origin query.
WHERE dss_load_datetime > (SELECT COALESCE(MAX(dss_load_datetime),'0001-01-01') FROM [Target])
This works best for views. If I have a stored procedure, I would load the result of the above statement into a variable
@last_dss_load_date and use it instead. A lot faster.
Source to Target Data Type mapping
This layer is only important if you care about data type consistency. Like in programming when you define variables with a specific type.
The mapped columns between source and target doesn’t necessarily have to be of the same data type. And if I have transformations in my source query, it is hardly known which data type I just created. Except you specify that in the source query. But I’m lazy about that. The system should fix that.
Any platform tries to avoid producing errors while inserting data into a table. They silently ignore the fact, that the data type don’t match. This is called implicit conversion. Have a look at the rules, what SQL Server does.
In this layer I create an explicit data type conversion. If I have transformations, I
CAST it to the target data type. If the source and the target data type don’t match, I use some of the rules from the implicit conversion where I believe they are safe. Anything else throws an error while creating a view or stored procedure.
Last time I was thinking, yeah, a business key is just a business key. Yeah, there are some rules. I know that they are important. But I didn’t know that they are THAT important.
This time I want to spend more time elaborating about business keys.
Business Keys are really fundamental! A table without business keys is really hard to understand and work with. Until now I have been cleaning business keys only for hubs as suggested by Dan Linstedt. If we load data later to dimensions and facts, the business keys for them are already cleaned.
Sometimes I don’t build a data vault model but going straight from my Data Store (=Persistent Staging Area) to Dimensions & Facts. Having a Data Store gives me the freedom to do that and later adding a Data Vault when the business rules are getting too complex and loading time is slow.
I changed my view that not only data vault objects should have their business keys cleaned, but all my tables: Data Stores, Hubs, Links, Satellite, Dimensions and Facts.
But wait, are we not going to change the data which it consists? Shouldn’t be the Data Store equal than the source system? If the source systems
CHAR column has mixed cases, shouldn’t we preserve it?
I make strong distinctions of business key columns and data columns. All columns of a source table are by definition data columns. I would derive my business key column from a data column. Business Key columns are here to ensure uniqueness in my target objects. But if the business key column is of type
INTEGER, there is no need to duplicate it. If you want to preserve a mixed-case
CHAR column, duplicate and preserve it.
Just a side note: If the mixed-case business key column should be preserved, we would have anyway a problem. Usually database joins don’t make a distinction of mixed cases. Then I would need to specify a collation with „ignore cases“ being turned off or using a binary collation.
So, how are business keys cleaned? The goal is, to get the value as close as possible to
NULL. So far I only clean
CHAR columns. Remove any spaces. If the data in the column is entered manually by a user and the application doesn’t check for clean data, I must even search for tabs, returns and other strange characters. The goal is to get the essence of the column value. Only human readable characters. A
CHAR column should also avoid having mixed cases. I set them all upper case.
My statement is:
The benefit of trying to
NULLify everything is, that the business key column has a constraint of
NOT NULL. Therefore, if I hit a
NULL it will crash and I will be safe of bad data.
Usually people see the business key cleaning and the following zero key replacement as a minor distinction. In this step I focus on the cleanliness of business keys. If the lineage from the source system via a data store and now to the target is kept, I clean it only once!
Zero Key Replacement
Here I „complete“ unknown data to ensure the working of my target. Data Stores, Hubs, Links, Satellites, Dimensions, Facts have different requirements. And all have Primary Keys. Primary Key values with
NULL don’t work. And asking somebody to bring me product „
NULL“ usually confuses people.
In the above step we might hit now a business key having a value of
NULL. What do we do with it when the target tells me that it shouldn’t be
The book says for Data Vault to replace it with „-1“. „-1“ is just the value to represent „unknown“.
Let’s take a step back. Is Data Vault the only model which has
NULL values? Fact tables referencing Dimensions could have
NULL values too. Can we formulate a rule for that?
I used to use „-1“ everywhere in my templates. But I got into problems because of different loading patterns. The right thing to do is, to lookup the value on the linked objects business key meta data. Hubs and Dimensions are the objects, for which we can define zero keys. Any other object references to them to find the right zero key. In WhereScape I can define a zero key for any dimensional column. I will lookup that value and use it as a zero key. For hubs it doesn’t exist, but I would do that too, if it would exist.
Here are my self-defined „unknown“ replacement values, if the linked objects business key has no zero key defined:
Unknown LoadDates are getting
'0001-01-01' so they will update when a record appears.
In the Date Dimension I use
'9999-12-31' for the zero key, so it is shown next to the newest data.
As you can see, for
TIME the unknown value falls into normal values. It is not possible to find values outside the scope.
DECIMALs are used for money values, quantities etc. Having a „-1“ is very dangerous.
Roelant Vos has written a blog post about more flavors of unknown. Maybe this is something more for your liking.
Having 2 distinct rule sets for cleaning business keys, we could still hit a
NULL value. How could that be? My fact tables are having also business keys defined. But not all business key columns have a related dimension. E.g. a line number of a sales order.
The result is now, that we have nice-looking business keys. Great!
In my previous model I had a layer to calculate hash keys and a layer to lookup dimension keys.
While pondering over the layers, a thought appeared: What do hash keys and dimension keys have in common? And how does a natural key fit into this picture?
Actually, they are all surrogate keys. A surrogate key is a key to represent a business key for further usage. To make it easier and faster.
But what type of surrogate keys do exist? Wikipedia lists the following characteristics for a surrogate key:
- the value is system generated
- the value is not manipulable by the user or application
- the value contains no semantic meaning
- the value is not visible to the user or application
Surrogate keys in practice just lists sequence keys with different data types (
GUID, etc.) as valid options.
I would say, Dan Linstedt invented a new surrogate key in the form of a string.
When searching the web, I didn’t find more types of surrogate keys. So, what are the options:
System-generated Surrogate Key (e.g. Sequence Key)
If we have a Fact object as a target, we want to join Dimensions to it. By default, we can join on business keys saved in both objects. If Dimensions and Facts are Views on top of a Data Vault, this will be definitely that case as it is impossible, to include a system-generated sequence key into a view.
If we have tables, we can gain some performance when we add an auto-generated sequence key on the dimension and use that value in the fact object. Instead of a multi-column
CHAR join, we could have now a four-byte
INTEGER really fast join option.
Theoretically you can also use
GUID or whatever data type is auto generated in your system. But
INTEGER is the fastest option.
This was also the idea and option in Data Vault 1.0. Being really fast on joins. There the surrogate key of hubs and links is also a sequence key. The systems used at the beginning of this centenary were not that fast as today’s systems. It made sense to have something like that.
User-generated Surrogate Key (e.g. Natural Key for Hubs and Links)
Instead of looking up the sequence key, Dan Linstedt improved the model to load data in parallel to the target. Everybody is speaking about the variation which consists of Hash Keys and so did I.
But in the last year I learnt more about the patterns behind it. There are platforms for which a hash key is not suggested. (e.g. Teradata has its own mechanism to treat business keys)
Therefore, it makes sense to focus more on the part which is shared on all platforms and what Dan really meant about the improvement.
Instead of having a sequence key we would need a concatenated string of all business keys separated by a separator. Just a string without hashing.
It took me some time to understand what the implications for that is. How it is saved on a hub, how it moves to the link to form the natural link key which then is used in satellites. All as strings.
This confirmed my resolution to see this as the real surrogate key for hubs and links. A hashed version of it is just for saving space, having a consistent data type (e.g.
BINARY(16) for MD5) and faster joins.
Although in Dimensions a sequence key is usually used, you could use also natural keys. Inserting data into Dimensions and Facts in parallel as in Data Vault. Do you see the intriguing similar pattern? Dan Linstedt has improved the dimensional model into the new model Data Vault. The patterns are the same. And I believe when I will dig into Anchor Modeling, I will discover the same loading pattern as shown here. Nothing new. To improve your fact and dimension views on top of a Data Vault, generate a natural key as a dimension key. Should give a better join performance.
This is truly a Universal Loading Pattern. Give me your flavor of modelling and I will load it.
To summarize, surrogate keys are just representing the business key and are used to improve performance and to make it easier to work with multi-column business keys.
There is nothing we can or should do with data columns. All manipulations are in the origin source query in the first section of the pipeline.
Calculate Change Diff
The only thing what we can do is, to improve the loading speed to the target. While inserting the data we need to know if a row is a change to the existing row or not.
We can compare data in a huge
WHERE clause, where we compare each column with the target.
Or we concatenate all column values into one string. Then the
WHERE clause is just a one-liner.
This pattern can be used for all objects auch as Data Store, Data Vault and Data Mart.
Of course, saving a huge string is a waste of space. The next section will help with that.
I thought hashing is the most important part in Data Vault. Actually, we need it only because of technical deficiencies of the platform.
In the pipeline we have now data columns, cleaned business key columns and surrogate keys.
Now comes the part to improve the performance (or decrease it, as hashing takes its toll on speed too). Hashing is only there to improve things. Not because we need it! If things get bad, don’t hash.
Instead of using natural keys as long strings, we could calculate and use a „representation“ value of that. A representation! Hashing would be an ideal way to accomplish that, as it transforms data of arbitrary size into a fixed size.
We can hash only the natural keys and change diffs.
That’s it! Nothing more to write. Stupid boring hashing.
By the way, have you noticed the multiple use of „representation“? A business key is represented by a surrogate key which is further represented by a hash value. Just to make it easy and fast to work with data.
So far, we talked only about one data set, one source query. What about having multiple data sets which are loaded to the same destination?
Here is the layer to merge them all into one data set before continuing to the next steps.
All steps before were source oriented. Now we change view to the target. What do we need to do, to load data to the target?
Which data is necessary to preserve and what can we omit?
Depending on the target object type, we can filter out now unneeded data. The target object tells us what it needs.
|We need only changes in rows partitioned by Business Key and sorted by LoadDate and SequenceNo
|We need only the first row partitioned by Hub Hash Key and sorted by LoadDate
|We need only the first row partitioned by Link Hash Key and sorted by LoadDate
|We need only changes in rows partitioned by Surrogate Key and sorted by LoadDate (and sorted by EventDate, if we don't need to persist every event)
|Dimensions should be already unique. No condensing.
|Facts should be already unique. No condensing.
Ok, people are using
DISTINCT for hub loading. But this works only for a single batch, a single LoadDate. If you load data in a full data set as I do, we need somehow to get 1 row out. Sorting it is the easiest form.
If you want to learn more about row condensing and more advanced forms, check Roelant’s blog.
Compare last saved row with first new row
When we load data incrementally to Data Stores or Satellites, it could happen, that a particular row with the same change hash is already present in the target.
Therefore, we can filter out all already existing rows in the target.
We have now a clean and relevant data set, ready to get inserted.
Data Vault evolved in the meantime into an insert-only model. All timeline calculations are deferred to views generated on top of a satellite.
As you know, I try to make my work as easy as possible. For creating views, I use this very same pipeline we have been discussing so far. If any of my targets has any of the specific timeline columns, it kicks in. No matter if the object is a table or a view.
Therefore, let’s discuss, how a timeline is built.
I maintain 2 timelines. One is based on the LoadDate, the other on EventDate.
For LoadDate I have the following columns:
For EventDate I have the following columns:
Currently I have no template to build bi-temporal objects. Hopefully… some when…
In a view we just use the calculation. In a table if one of the following columns exists, we need to
UPDATE outdated rows after
INSERTing new data to a table:
WHERE (NOT) EXISTS
Now comes the last test. If the data to get inserted already exists, I don’t want to cause any primary key violation.
What are the rules to detect existing rows:
|Unique Constraint Columns
|Business Key, Load Date and Sequence No
|Business Key or Natural Key
|Surrogate Keys from Hubs
|Surrogate Key (from Hub or Link) and LoadDate
The unique constrain can be created with the table and this is the true check for existing data.
But… Database systems might not behave as expected. In a
WHERE clause, a database system might try to match characters with each other like SQL Server does. It will not find a difference in „ss“ and „ß“. Set a binary collation in the
WHERE clause or create the Business Key
CHAR columns with a binary collation.
Don’t use hash keys for comparison. An accidental hash collision will be skipped and you are bereft of telling a story that you found one. The primary key constraint violation on the hash key will tell you that you found one. I believe, winning in a lottery is easier than finding a hash collision.
If you load Hubs and Links separately from Satellites (like here), this step can happen already after the Business Key cleaning. No need to go into the slow hashing function.
INSERT, UPDATE, DELETE
When we know now, if the data already exists or not, we can fire now the commands to change the data.
|Skipped. There can't be any changes in data.
|Skipped. A business vault satellite should rather be truncated and reloaded, if a business rule changes.
UPDATE (depending on the way you work with facts and what is faster)
In the diagram above you see some points in the pipeline. The whole pipeline could work without interruption if your database system can handle all the rules in a timely manner.
I discovered to persist data into a temp table before hashing, increases performance. SQL Server has sometimes odd behaviors. Calculating 100m hashes and then taking the newest 100.000 in a delta load. Very bad.
The next point is the turning point from source to target. Having a temporary object helps with loading several targets such as Hubs, Links and Satellites in parallel.
It can help also with the next step: To Update timelines.
Post-Process Update Timeline
This step is only needed, if we have persisted timeline end dates. The end date of an existing record changes from ‚9999-12-31‘ to the start date of the new record.
If we have a temporary object from above, we only need to update the changed timelines for a given Business Key. All others can be skipped.
Quite a change from my last blog post about Universal Loading Pattern. It has a lot more context now.
For me the biggest break-through was to emphasize on Business Keys and Surrogate Keys for any given object. Everything else will follow. And using hashes is just a technical necessity. Nothing else.
I hope it will help somebody designing loading procedures for any Data Warehouse Automation solution. Or at least being a checklist, if you have thought about all transformation rules to load data to a target.