Template: Satellite

Finally, we reach our goal of talking about Data Vault Objects. This is the new area which I’m working and learning from and probably you might be interested as well. Data Vault consists of mainly 3 objects:

  • Hub
  • Link
  • Satellite

I’m not going to discuss what Data Vault is. There are lots of resources in the Internet from people who are much smarter in teaching what Data Vault is than I am.

How does a loading procedure look like in WhereScape?

In this post, I’m going to talk about loading to a satellite. The next one will be about hubs and links.

First, a satellite is a system of record like a history table. The difference is that the data is collected around a HashKey of a hub or link (which represent business keys of a business concept) instead of the business key from the source table.

What are the requirements to load data into a satellite?

Stage

The creation of the stage table is discussed in-depth in another blog post. The main goal is to prepare the date to be loaded to a satellite.

Data from a history table is loaded first to a Raw Vault Satellite. There should be no transformations except hard business rules like data type changes.

If we need to clean the data, do some calculations or derive values from, we load data from existing satellites into a Business Vault satellite. Do whatever you want in the stage table.

After the data set is like we want it to be, we need to define the  hub/link_keys and the dss_change_hash. The cleaning of the business keys for the hub or link hash key is taken care of by the template.

Until the history level we had the following system columns:

  • dss_load_datetime
  • dss_sequence_no
  • dss_record_source

For satellite loading we are dropping the dss_sequence_no. For the first load from a history table we need to do also a little transformation by using the dss_sequence_no to make sure that we reduce the chance of hitting a duplicate row. No need to remove any rows from the source. Take them all!

DATEADD( NS , 100 * dss_sequence_no , dss_load_datetime )

Satellite

A table of records for a given HashKey. The Primary Key of a satellite is:

  • hub/link_key
  • dss_start_datetime

For a multi-active satellite add another business key to the table. Refrain as much as possible of doing that. As WhereScape doesn’t support multi-active satellite yet, I was forced to do it differently and this made my life easier by accident.

The dss_start_datetime is loaded by dss_load_datetime.

As I have an insert only system we move the calculation of the dss_end_datetime further down to a current or timeline view. All our satellites are Clustered Columnstored Indexed tables. Great performance for INSERT and SELECT.

View

To make the satellite more readable and not to write too many WHERE clauses in other objects we need to create 2 views:

  • Current
  • Timeline

The current view retrieves the last valid record for a hub/link_key while the timeline gives back a whole timeline. Currently I’m not sure what the first datetime of the timeline should be: '0001-01-01' or the first dss_start_datetime? The templates support both options.

Stored Procedure

Which steps do I take to load data from a stage table to a satellite:

  • Insert new records
    • INSERT INTO Satellite
    • SELECT Result
    • SUB-SELECT History Columns
    • SUB-SELECT Row Condensing
    • SUB-SELECT Where Not Exists
    • SUB-SELECT Extract from Stage Table

Extract from Stage Table

All the cleaning and transforming is done in stage. Now it’s time to load new data.

Where Not Exists

We have now 2 options. If you have a dss_version column we must make sure that the order of records inserted is correct. No old records are allowed.

If you delete this column like me, loading old and new data is possible.

SELECT *
FROM (sub_query) AS sat_demo
WHERE NOT EXISTS (
  SELECT 1
  FROM   sat_demo AS sat_demo__not_exist
  WHERE  <business key columns> = <business key columns>
         AND dss_load_datetime = dss_load_datetime

Row Condensing

This process is derived of the great work from Roeland Vos on this topic.

We could save all records in a satellite. But that would be a waste of space. So, we partition the data by hub/link_keys and order it by dss_load_datetime. What can we check the data for?

  • Is there a difference between dss_change_hash from one row to the other?
  • Is the dss_load_datetime different from one row to the other? This is a hard rule. We circumvent a Primary Key constrain warning by eliminating duplicates which is dangerous. BUT… we have already taken care of. We added in the stage table loading process a little time variance. The chance of hitting a duplicate now is nearly zero.

Following the snippets to calculate both:

CASE WHEN LAG( dss_change_hash , 1 , CONVERT(BINARY(16),'') ) OVER ( PARTITION BY hub_product_key ORDER BY dss_load_datetime ) = dss_change_hash THEN 0
     ELSE 1
END AS is_hash_change
CASE WHEN LAG( dss_load_datetime , 1 , '0001-01-01' ) OVER ( PARTITION BY hub_product_key ORDER BY dss_load_datetime ) = dss_load_datetime THEN 0
     ELSE 1
END AS is_load_datetime_change

History Column

The template offers currently the following history columns:

  • dss_start_datetime
  • dss_version

The dss_start_datetime is taken from dss_load_datetime.

If the first dss_start_datetime should be '0001-01-01' or not is decided in the timeline view.

Result

Here is the result including the dss_version column:

SELECT 
       sat_demo.hub_product_key AS hub_product_key 
      ,sat_demo.displayable_quantity AS displayable_quantity 
      ,sat_demo.dss_event_datetime AS dss_event_datetime 
      ,sat_demo.dss_load_datetime AS dss_load_datetime 
      ,sat_demo.dss_record_source AS dss_record_source 
      ,sat_demo.dss_change_hash AS dss_change_hash 
      ,sat_demo.dss_start_datetime AS dss_start_datetime 
      ,sat_demo.dss_version AS dss_version 
      ,@v_current_datetime AS dss_create_datetime 
FROM   (
  SELECT sat_demo.*
        ,dss_load_datetime AS dss_start_datetime
        ,CASE WHEN current_rows.hub_product_key IS NULL
              THEN 1
              ELSE current_rows.dss_version + dss_row_no
         END AS dss_version
  FROM   (
    SELECT sat_demo.*
          ,ROW_NUMBER() OVER ( PARTITION BY hub_product_key ORDER BY dss_load_datetime ) AS dss_row_no
    FROM   (
      SELECT sat_demo.*
            ,CASE WHEN LAG( dss_change_hash , 1 , CONVERT(BINARY(16),'') ) OVER ( PARTITION BY hub_product_key ORDER BY dss_load_datetime ) = dss_change_hash THEN 0
                  ELSE 1
             END AS is_hash_change
            ,CASE WHEN LAG( dss_load_datetime , 1 , '0001-01-01' ) OVER ( PARTITION BY hub_product_key ORDER BY dss_load_datetime ) = dss_load_datetime THEN 0
                  ELSE 1
             END AS is_load_datetime_change
      FROM   (
        SELECT sat_demo.*
        FROM   (
          SELECT 
                 stage_sat_demo.hub_product_key AS hub_product_key 
                ,stage_sat_demo.displayable_quantity AS displayable_quantity 
                ,stage_sat_demo.dss_event_datetime AS dss_event_datetime 
                ,stage_sat_demo.dss_load_datetime AS dss_load_datetime 
                ,stage_sat_demo.dss_record_source AS dss_record_source 
                ,stage_sat_demo.dss_change_hash AS dss_change_hash 
          FROM [TABLEOWNER].[stage_sat_demo] stage_sat_demo
        ) AS sat_demo
        WHERE NOT EXISTS (
                SELECT 1
                FROM   [TABLEOWNER].[sat_demo] sat_demo__not_exist
                WHERE  sat_demo__not_exist.hub_product_key = sat_demo.hub_product_key
                       AND sat_demo__not_exist.dss_load_datetime < sat_demo.dss_load_datetime
        )
      ) AS sat_demo
    ) AS sat_demo
    WHERE  is_hash_change = 1
    AND    is_load_datetime_change = 1
  ) AS sat_demo
  LEFT OUTER JOIN (
          SELECT
                 sat_demo__current.hub_product_key AS hub_product_key
                ,MAX(sat_demo__current.dss_start_datetime) AS dss_start_datetime
                ,MAX(sat_demo__current.dss_version) AS dss_version
          FROM   [TABLEOWNER].[sat_demo] sat_demo__current
          GROUP BY
                 sat_demo__current.hub_product_key
  ) AS current_rows
      ON  current_rows.hub_product_key = sat_demo.hub_product_key
) AS sat_demo

See the sql example in full here.

Views

Current View

To identify the most current record we must find out the most recent dss_start_datetime for a hub/link_key.

SELECT *
FROM  sat_demo
INNER JOIN (
          SELECT
                 hub_product_key AS hub_product_key
                ,MAX(dss_start_datetime) AS dss_start_datetime
          FROM   [sat_demo] AS sat_demo
          GROUP BY
                 hub_product_key
  ) AS current_rows
      ON  current_rows.hub_product_key = sat.hub_product_key
      AND current_rows.dss_start_datetime = sat.dss_start_datetime

 Timeline View

Currently there are 4 columns supported:

  • dss_start_datetime
  • dss_end_datetime
  • dss_effectivity_datetime
  • dss_expiry_datetime

Currently I’m not sure what will the first dss_start_datetime be. I added an extended property to define it. It will calculate the first dss_start_datetime with the following query:

CASE WHEN LAG(dss_start_datetime,1,NULL) OVER (PARTITION BY hub/link_key ORDER BY dss_start_datetime) IS NULL THEN '0001-01-01 ELSE dss_start_datetime END

Add dss_end_datetime as a column without a source column definition.

The template will calculate it with the following query:

LEAD(DATEADD(ns,-100,dss_start_datetime),1,'9999-12-31') OVER ( PARTITION BY hub/link_key ORDER BY dss_start_datetime )

The other 2 columns exist, but are not perfect yet. I need to learn more about bi-temporal stuff to get it right.

Implementation

Download from the template section the following templates for WhereScape written in Pebble:

  • nu3_sqlserver_procedure_sat
  • nu3_sqlserver_block_select_sat
  • nu3_sqlserver_ddl_view_sat_current
  • nu3_sqlserver_ddl_view_sat_timeline

There is not much to do to set it up. Just select the template. That’s it.