Template: History (DataStore)

Creating a history layer (or Persistent Staging Area from Roeland Vos) is a safeguarding layer. All data arrives unmodified from the source.

People in the Data Vault Space would probably suggest creating a Source Vault if I don’t want to build a proper Raw Vault with good Business Keys. I considered that option too. I would have probably built it also that way. But I thought, if WhereScape has a history layer already built in, why not using it? And I used that kind of history already in the past.

So, how do I ingest data to the history layer?

What are the requirements:

Load

We have any kind of source type and source system. Loading the external data to the landing table is the first step. Unmodified, as original as possible.

To the loaded data, 3 system columns are added:

  • dss_load_datetime
  • dss_sequence_no
  • dss_record_source

What is the dss_load_datetime? It is the datetime of that instance when the batch is inserted into the data warehouse. I give the whole batch the same datetime although it might load for minutes. Very important, it is not the extraction date. It is rather the insert date. See it as a human readable rowversion. In my language, it is also a batch. And there can be many batches in the load table. E.g. looping through files or looping through similar databases having the same data structure.

Each row in a batch is sequenced. I have seen that in the book from Dan Linstedt and in other designs. He doesn’t persist it, as it is not needed in satellites. But as I have a history layer I want to keep it for later use in satellites (I describe it later). If I can’t set e.g. the dss_sequence_no right in the load table, I add it in the stage step. Having it auto-increment in the load table is not possible, because I can have multiple batches.

The dss_record_source belongs to the Data Vault Standard. Just add it the way you want. We have the following rules:

  • {source_name}.{source_database}.{source_schema}.{source_table}
  • {source_name}.api.{api}
  • {filename} (which contains the subdirectory structure including the source_name)

Stage

The creation of the stage table is discussed in-depth in another blog post.

The goal is to add any missing system column which are not yet defined in the load table and adding a hash value for dss_change_hash.

If you have many batches, you can speed up the next step by adding a non-clustered index on dss_load_datetime.

History

A system of records for a given business key at a specific time. Only changes from one batch to another are persisted. How do we identify a batch? It’s the dss_load_datetime and this is always counting towards the future. Only in the first load you can load historical data. After that only new data arrives.

Persisting data over time needs at least the following system columns

  • dss_start_datetime
  • dss_end_datetime

WhereScape has also these other system columns on offer:

  • dss_current_flag
  • dss_version

Someone might think that the dss_start_datetime is the same as dss_load_datetime. I thought about that at the beginning too. But I had troubles later joining different datasets together.

I concluded that I start with the first batch at 0001-01-01 and having an end date of 9999-12-31. With the 2nd historization and following, dss_load_datetime matches dss_start_datetime. I still have the dss_load_datetime in my history table. This is used throughout the system.

Views

To have a current view could speed up the development later. No need to write WHERE clauses you might forget. And you can create in WhereScape an alias to another database, e.g. raw_mart.

The timeline is not needed as an extra view. It is already inside the history table. But to be consistent with later stuff, I added it here.

The error view is a nice add-on if you load data from users which tend to write anything into Excel and you want to be sure that it matches the later process. It doesn’t prevent (aka breaking the load), but informs you.

Stored Procedure

Which steps do I take:

  • Create Cursor from dss_load_datetime
  • Update expiring records
  • Insert new records
    • INSERT INTO History Table
    • SELECT Result
    • SUB-SELECT History Columns
    • SUB-SELECT Where Not Exists
    • SUB-SELECT Extract from Stage Table

Cursor

First, extract the last saved dss_load_datetime from the history table.

The cursor consists of a distinct and sorted list of dss_load_datetime which are equal or greater than the last dss_load_datetime. Why equal and not just greater than? If something breaks, I want to start over where it ended. And if it just takes another second to check, it is a safety check and no duplicate data is inserted.

Especially while migrating old data with thousands of batches, sometimes loading breaks. So, it is a good way to restart where it ended and not just do all over again. This feature is not available by default in WhereScape, so I used views before templates existed to just load not yet historized data.

Update expiring records

It sets for the expiring record a dss_end_datetime as 100 nanoseconds before the next dss_start_datetime. In that way, we can use BETWEEN in SQL statements. There is no time tick available 2 times. I recently learned another view on that topic from Dirk Lerner, but this is for another blog post. This is currently our rule set.

Extract from Stage Table

All the cleaning and transforming is done in stage. Now it’s time to load new data.

WHERE dss_load_datetime = cursor

We don’t clean for duplicates. The indexes will make sure that it will crash if you hit one.

Where Not Exists

To find out if a record has changed we need to get the latest record from the history table and check if the dss_change_hash differs.

SELECT *
FROM (sub_query) AS history_table
WHERE NOT EXISTS (
  SELECT 1
  FROM   history_table AS history_table_not_exist
  WHERE  <business key columns> = <business key columns>
         AND dss_change_hash = dss_change_hash
         AND dss_current_flag = 'Y'
)

History Columns

WhereScape has 4 history columns on offer. If you don’t need dss_version or dss_current_flag, delete them.

dss_start_datetimeCASE WHEN current_rows.shop_code IS NULL
THEN CAST('0001-01-01' AS DATETIME2(7))
ELSE hist_demo.dss_load_datetime
END
dss_end_datetimeCAST('9999-12-31' AS DATETIME2(7))
dss_current_flag'Y'
dss_versionCASE WHEN current_rows.shop_code IS NULL
THEN 1
ELSE current_rows.dss_version + 1
END

To get the last version we need to add a LEFT OUTER JOIN to extract the maximum version for a business key. See below.

Add-On

There is also a performance booster available if you insert history data with 1000s of batches. I experienced, when I refresh the indexes every 100 batches the insert performance is accelerated. But only if the table is not too big. Delete it or use it.

Future

Instead of looping through a cursor I want to load it in one go. I didn’t had time yet to explore and define a template, so you will have to wait. And it needs to be bullet-proof, because the history layer is the foundation for everything afterwards.

Result

Here is the query to UPDATE:

UPDATE [TABLEOWNER].[hist_demo] WITH ( TABLOCKX )
SET    dss_end_datetime = DATEADD(NS,-100,stage_hist_demo.dss_load_datetime)
      ,dss_current_flag = 'N'
      ,dss_update_datetime = @v_current_datetime
FROM [TABLEOWNER].[stage_hist_demo] stage_hist_demo
WHERE  stage_hist_demo.dss_load_datetime = @v_batch_start
       AND hist_demo.dss_current_flag = 'Y'
       AND hist_demo.shop_code = stage_hist_demo.shop_code
       AND hist_demo.sid = stage_hist_demo.sid
       AND hist_demo.request_id = stage_hist_demo.request_id
       AND (
              hist_demo.dss_change_hash <> stage_hist_demo.dss_change_hash
           )

Here is the query to INSERT:

INSERT INTO [TABLEOWNER].[hist_demo] WITH ( TABLOCKX )
(
       shop_code
      ,sid
      ,request_id
      ,times
      ,order_id
      ,total_price
      ,dss_load_datetime
      ,dss_sequence_no
      ,dss_record_source
      ,dss_change_hash
      ,dss_start_datetime
      ,dss_end_datetime
      ,dss_current_flag
      ,dss_version
      ,dss_create_datetime
      ,dss_update_datetime
)
SELECT 
       hist_demo.shop_code AS shop_code 
      ,hist_demo.sid AS sid 
      ,hist_demo.request_id AS request_id 
      ,hist_demo.times AS times 
      ,hist_demo.order_id AS order_id 
      ,hist_demo.total_price AS total_price 
      ,hist_demo.dss_load_datetime AS dss_load_datetime 
      ,hist_demo.dss_sequence_no AS dss_sequence_no 
      ,hist_demo.dss_record_source AS dss_record_source 
      ,hist_demo.dss_change_hash AS dss_change_hash 
      ,hist_demo.dss_start_datetime AS dss_start_datetime 
      ,hist_demo.dss_end_datetime AS dss_end_datetime 
      ,hist_demo.dss_current_flag AS dss_current_flag 
      ,hist_demo.dss_version AS dss_version 
      ,@v_current_datetime AS dss_create_datetime 
      ,@v_current_datetime AS dss_update_datetime 
FROM   (
  SELECT hist_demo.*
        ,CASE WHEN current_rows.shop_code IS NULL
              THEN CAST('0001-01-01' AS DATETIME2(7))
              ELSE hist_demo.dss_load_datetime
         END AS dss_start_datetime 
        ,CAST('9999-12-31' AS DATETIME2(7)) AS dss_end_datetime 
        ,'Y' AS dss_current_flag 
        ,CASE WHEN current_rows.shop_code IS NULL
              THEN 1
              ELSE current_rows.dss_version + 1
         END AS dss_version 
  FROM   (
    SELECT hist_demo.*
    FROM   (
      SELECT 
             stage_hist_demo.shop_code AS shop_code 
            ,stage_hist_demo.sid AS sid 
            ,stage_hist_demo.request_id AS request_id 
            ,stage_hist_demo.times AS times 
            ,stage_hist_demo.order_id AS order_id 
            ,stage_hist_demo.total_price AS total_price 
            ,stage_hist_demo.dss_load_datetime AS dss_load_datetime 
            ,stage_hist_demo.dss_sequence_no AS dss_sequence_no 
            ,stage_hist_demo.dss_record_source AS dss_record_source 
            ,stage_hist_demo.dss_change_hash AS dss_change_hash 
      FROM [TABLEOWNER].[stage_hist_demo] stage_hist_demo
      WHERE  stage_hist_demo.dss_load_datetime = @v_batch_start
    ) AS hist_demo
    WHERE NOT EXISTS (
            SELECT 1
            FROM   [TABLEOWNER].[hist_demo] hist_demo__not_exist
            WHERE  hist_demo__not_exist.shop_code = hist_demo.shop_code
                   AND hist_demo__not_exist.sid = hist_demo.sid
                   AND hist_demo__not_exist.request_id = hist_demo.request_id
                   AND hist_demo__not_exist.dss_change_hash = hist_demo.dss_change_hash
                   AND hist_demo__not_exist.dss_current_flag = 'Y'
    )
  ) AS hist_demo
  LEFT OUTER JOIN (
          SELECT
                 hist_demo__current.shop_code AS shop_code
                ,hist_demo__current.sid AS sid
                ,hist_demo__current.request_id AS request_id
                ,MAX(hist_demo__current.dss_version) AS dss_version
          FROM  [TABLEOWNER].[hist_demo] hist_demo__current
          GROUP BY
                 hist_demo__current.shop_code
                ,hist_demo__current.sid
                ,hist_demo__current.request_id
  ) AS current_rows
      ON  current_rows.shop_code = hist_demo.shop_code
      AND current_rows.sid = hist_demo.sid
      AND current_rows.request_id = hist_demo.request_id
) AS hist_demo

Of course, it would be possible to replace both steps with a MERGE statement. My experience with that is not the best. With huge tables or data sets it just takes time and sometimes SQL Server „got irritated“ about what to do.

Current View

Often, we use only the current valid record. Instead of writing everywhere WHERE clauses, I just create a view from the history table and select the appropriate template. That’s it. What will it do?

It adds a simple WHERE dss_current_flag = 'Y'. Not that difficult. But it will get more complex when we are coming to Satellites and Link Satellites.

If this view is business critical, I added also a template for an indexed view.

Implementation

Download from the template section the following templates for WhereScape written in Pebble:

  • nu3_sqlserver_procedure_hist
  • nu3_sqlserver_view_hist_current

To set up:

  • Select the template „nu3_sqlserver_procedure_history“
  • Set the Business Key Columns
  • Set the Change Detection Fields to „dss_change_hash“

That’s it.