Okt 21 2018
The best invention in my Business Intelligence career is the adoption of the concept of having a LoadDate.
This should be a post to give credit to Dan Linstedt and thank him for inventing Data Vault and introducing LoadDate.
When I encountered the hub-and-spoke concept of Data Vault I thought, why on earth do I need a LoadDate? Combining in a Satellite the HubHashKey with the LoadDate to form the Primary Key makes sense. But why would I need it in a Hub or a Link? Just to know when a Hub entry was first introduced to the data warehouse? The usage is universal and far reaching!
What is the LoadDate? The definition is:
„The LoadDate is the date and time of insert for the record“
It is the moment when a record is first inserted into a Data Warehouse. It is not the extract date from the source or the current datetime in the source system. It is really the
INSERT datetime in my Data Warehouse! The first time a record exists.
How to ingest it:
- Save the value in a variable and insert it with the data set into the load table
- Set a default constrain like
SYSDATETIME()on the LoadDate column
What about when the insert takes a long time with a lot of data? Do I get a LoadDate for every second it loads? This would be very precise. In my world one LoadDate is one batch. One batch could be the load of one file. The next file has a new LoadDate. When loading data in a loop of something, each loop has a new LoadDate.
When the record has been inserted into the load table, the LoadDate stays with the record. Where ever it goes! Like a tracking marker.
The LoadDate is used in:
- Data Store
I use it everywhere!
In a Data Store (aka Persistent Staging Area) I use it to order the data and batch-load the data to the target. Together with the SequenceNo it forms also the Primary Key.
In Link and Hub I know when the first appearance of a business key happened.
In Raw Vault Satellites I use it to load data incrementally from the Data Store.
In Business Vault Satellites where multiple source tables are joined together, which LoadDate succeeds? If history is important, I use it to create mini-Point-in-Time tables. If history is not important, I use the most recent record and calculate the
MAX value of all LoadDates.
SELECT * ,( SELECT MAX(v) FROM (VALUES (table_1.dss_load_date) ,(table_2.dss_load_date) ,... ) AS VALUE(v) ) AS dss_load_datetime ...
In that way I can use delta loading and it fits perfectly to my Universal Loading Pattern.
How can I use it in Dimensions & Facts? The main question in those objects is, when do I need to
UPDATE columns with new data? When I was researching the best way, I used first to calculate a ChangeHash. Very time consuming! After detecting the formula above, I thought, why not use the LoadDate for change detection? If for a given Business Key the LoadDate changes, chances are high that the record needs an
With this approach I’m much faster now. And I’m also faster in preparing the data in a stage table as I can use the LoadDate for the delta loading layer too.
Are there any other similar concepts? In SQL Server one could use rowversion (altered from timestamp). It is a system-wide incremental number saved as a binary. But copying data into the next table will create a new rowversion except you extract the value and save it as a normal value. In my view not easy to handle and not readable. How do you know, when record 1234 was inserted?
Dan Linstedt: Thanks for introducing LoadDate to the Business Intelligence Community and forcing users to use it correctly. It is of tremendous help! I use it in batch loading, delta loading and change detection (only in Dimensions & Facts). I couldn’t do without it anymore. Thanks a lot!