Dez 16 2017
Finally, we reach our goal of talking about Data Vault Objects. This is the new area which I’m working and learning from and probably you might be interested as well. Data Vault consists of mainly 3 objects:
I’m not going to discuss what Data Vault is. There are lots of resources in the Internet from people who are much smarter in teaching what Data Vault is than I am.
How does a loading procedure look like in WhereScape?
In this post, I’m going to talk about loading to a satellite. The next one will be about hubs and links.
First, a satellite is a system of record like a history table. The difference is that the data is collected around a HashKey of a hub or link (which represent business keys of a business concept) instead of the business key from the source table.
What are the requirements to load data into a satellite?
The creation of the stage table is discussed in-depth in another blog post. The main goal is to prepare the date to be loaded to a satellite.
Data from a history table is loaded first to a Raw Vault Satellite. There should be no transformations except hard business rules like data type changes.
If we need to clean the data, do some calculations or derive values from, we load data from existing satellites into a Business Vault satellite. Do whatever you want in the stage table.
After the data set is like we want it to be, we need to define the
hub/link_keys and the
dss_change_hash. The cleaning of the business keys for the hub or link hash key is taken care of by the template.
Until the history level we had the following system columns:
For satellite loading we are dropping the
dss_sequence_no. For the first load from a history table we need to do also a little transformation by using the
dss_sequence_no to make sure that we reduce the chance of hitting a duplicate row. No need to remove any rows from the source. Take them all!
DATEADD( NS , 100 * dss_sequence_no , dss_load_datetime )
A table of records for a given HashKey. The Primary Key of a satellite is:
For a multi-active satellite add another business key to the table. Refrain as much as possible of doing that. As WhereScape doesn’t support multi-active satellite yet, I was forced to do it differently and this made my life easier by accident.
dss_start_datetime is loaded by
As I have an insert only system we move the calculation of the
dss_end_datetime further down to a current or timeline view. All our satellites are Clustered Columnstored Indexed tables. Great performance for
To make the satellite more readable and not to write too many
WHERE clauses in other objects we need to create 2 views:
The current view retrieves the last valid record for a
hub/link_key while the timeline gives back a whole timeline. Currently I’m not sure what the first datetime of the timeline should be:
'0001-01-01' or the first
dss_start_datetime? The templates support both options.
Which steps do I take to load data from a stage table to a satellite:
Insert new records
INSERT INTO Satellite
SUB-SELECT History Columns
SUB-SELECT Row Condensing
SUB-SELECT Where Not Exists
SUB-SELECT Extract from Stage Table
Extract from Stage Table
All the cleaning and transforming is done in stage. Now it’s time to load new data.
Where Not Exists
We have now 2 options. If you have a
dss_version column we must make sure that the order of records inserted is correct. No old records are allowed.
If you delete this column like me, loading old and new data is possible.
SELECT * FROM (sub_query) AS sat_demo WHERE NOT EXISTS ( SELECT 1 FROM sat_demo AS sat_demo__not_exist WHERE <business key columns> = <business key columns> AND dss_load_datetime = dss_load_datetime
We could save all records in a satellite. But that would be a waste of space. So, we partition the data by
hub/link_keys and order it by
dss_load_datetime. What can we check the data for?
- Is there a difference between
dss_change_hashfrom one row to the other?
- Is the
dss_load_datetimedifferent from one row to the other? This is a hard rule. We circumvent a Primary Key constrain warning by eliminating duplicates which is dangerous. BUT… we have already taken care of. We added in the stage table loading process a little time variance. The chance of hitting a duplicate now is nearly zero.
Following the snippets to calculate both:
CASE WHEN LAG( dss_change_hash , 1 , CONVERT(BINARY(16),'') ) OVER ( PARTITION BY hub_product_key ORDER BY dss_load_datetime ) = dss_change_hash THEN 0 ELSE 1 END AS is_hash_change
CASE WHEN LAG( dss_load_datetime , 1 , '0001-01-01' ) OVER ( PARTITION BY hub_product_key ORDER BY dss_load_datetime ) = dss_load_datetime THEN 0 ELSE 1 END AS is_load_datetime_change
The template offers currently the following history columns:
dss_start_datetime is taken from
If the first
dss_start_datetime should be
'0001-01-01' or not is decided in the timeline view.
Here is the result including the
SELECT sat_demo.hub_product_key AS hub_product_key ,sat_demo.displayable_quantity AS displayable_quantity ,sat_demo.dss_event_datetime AS dss_event_datetime ,sat_demo.dss_load_datetime AS dss_load_datetime ,sat_demo.dss_record_source AS dss_record_source ,sat_demo.dss_change_hash AS dss_change_hash ,sat_demo.dss_start_datetime AS dss_start_datetime ,sat_demo.dss_version AS dss_version ,@v_current_datetime AS dss_create_datetime FROM ( SELECT sat_demo.* ,dss_load_datetime AS dss_start_datetime ,CASE WHEN current_rows.hub_product_key IS NULL THEN 1 ELSE current_rows.dss_version + dss_row_no END AS dss_version FROM ( SELECT sat_demo.* ,ROW_NUMBER() OVER ( PARTITION BY hub_product_key ORDER BY dss_load_datetime ) AS dss_row_no FROM ( SELECT sat_demo.* ,CASE WHEN LAG( dss_change_hash , 1 , CONVERT(BINARY(16),'') ) OVER ( PARTITION BY hub_product_key ORDER BY dss_load_datetime ) = dss_change_hash THEN 0 ELSE 1 END AS is_hash_change ,CASE WHEN LAG( dss_load_datetime , 1 , '0001-01-01' ) OVER ( PARTITION BY hub_product_key ORDER BY dss_load_datetime ) = dss_load_datetime THEN 0 ELSE 1 END AS is_load_datetime_change FROM ( SELECT sat_demo.* FROM ( SELECT stage_sat_demo.hub_product_key AS hub_product_key ,stage_sat_demo.displayable_quantity AS displayable_quantity ,stage_sat_demo.dss_event_datetime AS dss_event_datetime ,stage_sat_demo.dss_load_datetime AS dss_load_datetime ,stage_sat_demo.dss_record_source AS dss_record_source ,stage_sat_demo.dss_change_hash AS dss_change_hash FROM [TABLEOWNER].[stage_sat_demo] stage_sat_demo ) AS sat_demo WHERE NOT EXISTS ( SELECT 1 FROM [TABLEOWNER].[sat_demo] sat_demo__not_exist WHERE sat_demo__not_exist.hub_product_key = sat_demo.hub_product_key AND sat_demo__not_exist.dss_load_datetime < sat_demo.dss_load_datetime ) ) AS sat_demo ) AS sat_demo WHERE is_hash_change = 1 AND is_load_datetime_change = 1 ) AS sat_demo LEFT OUTER JOIN ( SELECT sat_demo__current.hub_product_key AS hub_product_key ,MAX(sat_demo__current.dss_start_datetime) AS dss_start_datetime ,MAX(sat_demo__current.dss_version) AS dss_version FROM [TABLEOWNER].[sat_demo] sat_demo__current GROUP BY sat_demo__current.hub_product_key ) AS current_rows ON current_rows.hub_product_key = sat_demo.hub_product_key ) AS sat_demo
See the sql example in full here.
To identify the most current record we must find out the most recent
dss_start_datetime for a
SELECT * FROM sat_demo INNER JOIN ( SELECT hub_product_key AS hub_product_key ,MAX(dss_start_datetime) AS dss_start_datetime FROM [sat_demo] AS sat_demo GROUP BY hub_product_key ) AS current_rows ON current_rows.hub_product_key = sat.hub_product_key AND current_rows.dss_start_datetime = sat.dss_start_datetime
Currently there are 4 columns supported:
Currently I’m not sure what will the first
dss_start_datetime be. I added an extended property to define it. It will calculate the first
dss_start_datetime with the following query:
CASE WHEN LAG(dss_start_datetime,1,NULL) OVER (PARTITION BY hub/link_key ORDER BY dss_start_datetime) IS NULL THEN '0001-01-01 ELSE dss_start_datetime END
dss_end_datetime as a column without a source column definition.
The template will calculate it with the following query:
LEAD(DATEADD(ns,-100,dss_start_datetime),1,'9999-12-31') OVER ( PARTITION BY hub/link_key ORDER BY dss_start_datetime )
The other 2 columns exist, but are not perfect yet. I need to learn more about bi-temporal stuff to get it right.
Download from the template section the following templates for WhereScape written in Pebble:
There is not much to do to set it up. Just select the template. That’s it.