Finally, we reach our goal of talking about Data Vault Objects. This is the new area which I’m working and learning from and probably you might be interested as well. Data Vault consists of mainly 3 objects:
I’m not going to discuss what Data Vault is. There are lots of resources in the Internet from people who are much smarter in teaching what Data Vault is than I am.
How does a loading procedure look like in WhereScape?
In this post, I’m going to talk about loading to a satellite. The next one will be about hubs and links.
First, a satellite is a system of record like a history table. The difference is that the data is collected around a HashKey of a hub or link (which represent business keys of a business concept) instead of the business key from the source table.
What are the requirements to load data into a satellite?
Stage
The creation of the stage table is discussed in-depth in another blog post. The main goal is to prepare the date to be loaded to a satellite.
Data from a history table is loaded first to a Raw Vault Satellite. There should be no transformations except hard business rules like data type changes.
If we need to clean the data, do some calculations or derive values from, we load data from existing satellites into a Business Vault satellite. Do whatever you want in the stage table.
After the data set is like we want it to be, we need to define the hub/link_keys
and the dss_change_hash
. The cleaning of the business keys for the hub or link hash key is taken care of by the template.
Until the history level we had the following system columns:
dss_load_datetime
dss_sequence_no
dss_record_source
For satellite loading we are dropping the dss_sequence_no
. For the first load from a history table we need to do also a little transformation by using the dss_sequence_no
to make sure that we reduce the chance of hitting a duplicate row. No need to remove any rows from the source. Take them all!
DATEADD( NS , 100 * dss_sequence_no , dss_load_datetime )
Satellite
A table of records for a given HashKey. The Primary Key of a satellite is:
hub/link_key
dss_start_datetime
For a multi-active satellite add another business key to the table. Refrain as much as possible of doing that. As WhereScape doesn’t support multi-active satellite yet, I was forced to do it differently and this made my life easier by accident.
The dss_start_datetime
is loaded by dss_load_datetime
.
As I have an insert only system we move the calculation of the dss_end_datetime
further down to a current or timeline view. All our satellites are Clustered Columnstored Indexed tables. Great performance for INSERT
and SELECT
.
View
To make the satellite more readable and not to write too many WHERE
clauses in other objects we need to create 2 views:
The current view retrieves the last valid record for a hub/link_key
while the timeline gives back a whole timeline. Currently I’m not sure what the first datetime of the timeline should be: '0001-01-01'
or the first dss_start_datetime
? The templates support both options.
Stored Procedure
Which steps do I take to load data from a stage table to a satellite:
Insert new records
INSERT INTO Satellite
SELECT Result
SUB-SELECT History Columns
SUB-SELECT Row Condensing
SUB-SELECT Where Not Exists
SUB-SELECT Extract from Stage Table
Extract from Stage Table
All the cleaning and transforming is done in stage. Now it’s time to load new data.
Where Not Exists
We have now 2 options. If you have a dss_version
column we must make sure that the order of records inserted is correct. No old records are allowed.
If you delete this column like me, loading old and new data is possible.
SELECT *
FROM (sub_query) AS sat_demo
WHERE NOT EXISTS (
SELECT 1
FROM sat_demo AS sat_demo__not_exist
WHERE <business key columns> = <business key columns>
AND dss_load_datetime = dss_load_datetime
Row Condensing
This process is derived of the great work from Roeland Vos on this topic.
We could save all records in a satellite. But that would be a waste of space. So, we partition the data by hub/link_keys
and order it by dss_load_datetime
. What can we check the data for?
- Is there a difference between
dss_change_hash
from one row to the other?
- Is the
dss_load_datetime
different from one row to the other? This is a hard rule. We circumvent a Primary Key constrain warning by eliminating duplicates which is dangerous. BUT… we have already taken care of. We added in the stage table loading process a little time variance. The chance of hitting a duplicate now is nearly zero.
Following the snippets to calculate both:
CASE WHEN LAG( dss_change_hash , 1 , CONVERT(BINARY(16),'') ) OVER ( PARTITION BY hub_product_key ORDER BY dss_load_datetime ) = dss_change_hash THEN 0
ELSE 1
END AS is_hash_change
CASE WHEN LAG( dss_load_datetime , 1 , '0001-01-01' ) OVER ( PARTITION BY hub_product_key ORDER BY dss_load_datetime ) = dss_load_datetime THEN 0
ELSE 1
END AS is_load_datetime_change
History Column
The template offers currently the following history columns:
dss_start_datetime
dss_version
The dss_start_datetime
is taken from dss_load_datetime
.
If the first dss_start_datetime
should be '0001-01-01'
or not is decided in the timeline view.
Result
Here is the result including the dss_version
column:
SELECT
sat_demo.hub_product_key AS hub_product_key
,sat_demo.displayable_quantity AS displayable_quantity
,sat_demo.dss_event_datetime AS dss_event_datetime
,sat_demo.dss_load_datetime AS dss_load_datetime
,sat_demo.dss_record_source AS dss_record_source
,sat_demo.dss_change_hash AS dss_change_hash
,sat_demo.dss_start_datetime AS dss_start_datetime
,sat_demo.dss_version AS dss_version
,@v_current_datetime AS dss_create_datetime
FROM (
SELECT sat_demo.*
,dss_load_datetime AS dss_start_datetime
,CASE WHEN current_rows.hub_product_key IS NULL
THEN 1
ELSE current_rows.dss_version + dss_row_no
END AS dss_version
FROM (
SELECT sat_demo.*
,ROW_NUMBER() OVER ( PARTITION BY hub_product_key ORDER BY dss_load_datetime ) AS dss_row_no
FROM (
SELECT sat_demo.*
,CASE WHEN LAG( dss_change_hash , 1 , CONVERT(BINARY(16),'') ) OVER ( PARTITION BY hub_product_key ORDER BY dss_load_datetime ) = dss_change_hash THEN 0
ELSE 1
END AS is_hash_change
,CASE WHEN LAG( dss_load_datetime , 1 , '0001-01-01' ) OVER ( PARTITION BY hub_product_key ORDER BY dss_load_datetime ) = dss_load_datetime THEN 0
ELSE 1
END AS is_load_datetime_change
FROM (
SELECT sat_demo.*
FROM (
SELECT
stage_sat_demo.hub_product_key AS hub_product_key
,stage_sat_demo.displayable_quantity AS displayable_quantity
,stage_sat_demo.dss_event_datetime AS dss_event_datetime
,stage_sat_demo.dss_load_datetime AS dss_load_datetime
,stage_sat_demo.dss_record_source AS dss_record_source
,stage_sat_demo.dss_change_hash AS dss_change_hash
FROM [TABLEOWNER].[stage_sat_demo] stage_sat_demo
) AS sat_demo
WHERE NOT EXISTS (
SELECT 1
FROM [TABLEOWNER].[sat_demo] sat_demo__not_exist
WHERE sat_demo__not_exist.hub_product_key = sat_demo.hub_product_key
AND sat_demo__not_exist.dss_load_datetime < sat_demo.dss_load_datetime
)
) AS sat_demo
) AS sat_demo
WHERE is_hash_change = 1
AND is_load_datetime_change = 1
) AS sat_demo
LEFT OUTER JOIN (
SELECT
sat_demo__current.hub_product_key AS hub_product_key
,MAX(sat_demo__current.dss_start_datetime) AS dss_start_datetime
,MAX(sat_demo__current.dss_version) AS dss_version
FROM [TABLEOWNER].[sat_demo] sat_demo__current
GROUP BY
sat_demo__current.hub_product_key
) AS current_rows
ON current_rows.hub_product_key = sat_demo.hub_product_key
) AS sat_demo
See the sql example in full here.
Views
Current View
To identify the most current record we must find out the most recent dss_start_datetime
for a hub/link_key
.
SELECT *
FROM sat_demo
INNER JOIN (
SELECT
hub_product_key AS hub_product_key
,MAX(dss_start_datetime) AS dss_start_datetime
FROM [sat_demo] AS sat_demo
GROUP BY
hub_product_key
) AS current_rows
ON current_rows.hub_product_key = sat.hub_product_key
AND current_rows.dss_start_datetime = sat.dss_start_datetime
Timeline View
Currently there are 4 columns supported:
dss_start_datetime
dss_end_datetime
dss_effectivity_datetime
dss_expiry_datetime
Currently I’m not sure what will the first dss_start_datetime
be. I added an extended property to define it. It will calculate the first dss_start_datetime
with the following query:
CASE WHEN LAG(dss_start_datetime,1,NULL) OVER (PARTITION BY hub/link_key ORDER BY dss_start_datetime) IS NULL THEN '0001-01-01 ELSE dss_start_datetime END
Add dss_end_datetime
as a column without a source column definition.
The template will calculate it with the following query:
LEAD(DATEADD(ns,-100,dss_start_datetime),1,'9999-12-31') OVER ( PARTITION BY hub/link_key ORDER BY dss_start_datetime )
The other 2 columns exist, but are not perfect yet. I need to learn more about bi-temporal stuff to get it right.
Implementation
Download from the template section the following templates for WhereScape written in Pebble:
- nu3_sqlserver_procedure_sat
- nu3_sqlserver_block_select_sat
- nu3_sqlserver_ddl_view_sat_current
- nu3_sqlserver_ddl_view_sat_timeline
There is not much to do to set it up. Just select the template. That’s it.
Dez. 16 2017
Template: Satellite
Finally, we reach our goal of talking about Data Vault Objects. This is the new area which I’m working and learning from and probably you might be interested as well. Data Vault consists of mainly 3 objects:
I’m not going to discuss what Data Vault is. There are lots of resources in the Internet from people who are much smarter in teaching what Data Vault is than I am.
How does a loading procedure look like in WhereScape?
In this post, I’m going to talk about loading to a satellite. The next one will be about hubs and links.
First, a satellite is a system of record like a history table. The difference is that the data is collected around a HashKey of a hub or link (which represent business keys of a business concept) instead of the business key from the source table.
What are the requirements to load data into a satellite?
Stage
The creation of the stage table is discussed in-depth in another blog post. The main goal is to prepare the date to be loaded to a satellite.
Data from a history table is loaded first to a Raw Vault Satellite. There should be no transformations except hard business rules like data type changes.
If we need to clean the data, do some calculations or derive values from, we load data from existing satellites into a Business Vault satellite. Do whatever you want in the stage table.
After the data set is like we want it to be, we need to define the
hub/link_keys
and thedss_change_hash
. The cleaning of the business keys for the hub or link hash key is taken care of by the template.Until the history level we had the following system columns:
dss_load_datetime
dss_sequence_no
dss_record_source
For satellite loading we are dropping the
dss_sequence_no
. For the first load from a history table we need to do also a little transformation by using thedss_sequence_no
to make sure that we reduce the chance of hitting a duplicate row. No need to remove any rows from the source. Take them all!DATEADD( NS , 100 * dss_sequence_no , dss_load_datetime )
Satellite
A table of records for a given HashKey. The Primary Key of a satellite is:
hub/link_key
dss_start_datetime
For a multi-active satellite add another business key to the table. Refrain as much as possible of doing that. As WhereScape doesn’t support multi-active satellite yet, I was forced to do it differently and this made my life easier by accident.
The
dss_start_datetime
is loaded bydss_load_datetime
.As I have an insert only system we move the calculation of the
dss_end_datetime
further down to a current or timeline view. All our satellites are Clustered Columnstored Indexed tables. Great performance forINSERT
andSELECT
.View
To make the satellite more readable and not to write too many
WHERE
clauses in other objects we need to create 2 views:The current view retrieves the last valid record for a
hub/link_key
while the timeline gives back a whole timeline. Currently I’m not sure what the first datetime of the timeline should be:'0001-01-01'
or the firstdss_start_datetime
? The templates support both options.Stored Procedure
Which steps do I take to load data from a stage table to a satellite:
Insert new records
INSERT INTO Satellite
SELECT Result
SUB-SELECT History Columns
SUB-SELECT Row Condensing
SUB-SELECT Where Not Exists
SUB-SELECT Extract from Stage Table
Extract from Stage Table
All the cleaning and transforming is done in stage. Now it’s time to load new data.
Where Not Exists
We have now 2 options. If you have a
dss_version
column we must make sure that the order of records inserted is correct. No old records are allowed.If you delete this column like me, loading old and new data is possible.
Row Condensing
This process is derived of the great work from Roeland Vos on this topic.
We could save all records in a satellite. But that would be a waste of space. So, we partition the data by
hub/link_keys
and order it bydss_load_datetime
. What can we check the data for?dss_change_hash
from one row to the other?dss_load_datetime
different from one row to the other? This is a hard rule. We circumvent a Primary Key constrain warning by eliminating duplicates which is dangerous. BUT… we have already taken care of. We added in the stage table loading process a little time variance. The chance of hitting a duplicate now is nearly zero.Following the snippets to calculate both:
History Column
The template offers currently the following history columns:
dss_start_datetime
dss_version
The
dss_start_datetime
is taken fromdss_load_datetime
.If the first
dss_start_datetime
should be'0001-01-01'
or not is decided in the timeline view.Result
Here is the result including the
dss_version
column:See the sql example in full here.
Views
Current View
To identify the most current record we must find out the most recent
dss_start_datetime
for ahub/link_key
.Timeline View
Currently there are 4 columns supported:
dss_start_datetime
dss_end_datetime
dss_effectivity_datetime
dss_expiry_datetime
Currently I’m not sure what will the first
dss_start_datetime
be. I added an extended property to define it. It will calculate the firstdss_start_datetime
with the following query:CASE WHEN LAG(dss_start_datetime,1,NULL) OVER (PARTITION BY hub/link_key ORDER BY dss_start_datetime) IS NULL THEN '0001-01-01 ELSE dss_start_datetime END
Add
dss_end_datetime
as a column without a source column definition.The template will calculate it with the following query:
LEAD(DATEADD(ns,-100,dss_start_datetime),1,'9999-12-31') OVER ( PARTITION BY hub/link_key ORDER BY dss_start_datetime )
The other 2 columns exist, but are not perfect yet. I need to learn more about bi-temporal stuff to get it right.
Implementation
Download from the template section the following templates for WhereScape written in Pebble:
There is not much to do to set it up. Just select the template. That’s it.
By ecki • Business Intelligence 0