Creating a history layer (or Persistent Staging Area from Roeland Vos) is a safeguarding layer. All data arrives unmodified from the source.
People in the Data Vault Space would probably suggest creating a Source Vault if I don’t want to build a proper Raw Vault with good Business Keys. I considered that option too. I would have probably built it also that way. But I thought, if WhereScape has a history layer already built in, why not using it? And I used that kind of history already in the past.
So, how do I ingest data to the history layer?
What are the requirements:
Load
We have any kind of source type and source system. Loading the external data to the landing table is the first step. Unmodified, as original as possible.
To the loaded data, 3 system columns are added:
dss_load_datetime
dss_sequence_no
dss_record_source
What is the dss_load_datetime
? It is the datetime of that instance when the batch is inserted into the data warehouse. I give the whole batch the same datetime although it might load for minutes. Very important, it is not the extraction date. It is rather the insert date. See it as a human readable rowversion. In my language, it is also a batch. And there can be many batches in the load table. E.g. looping through files or looping through similar databases having the same data structure.
Each row in a batch is sequenced. I have seen that in the book from Dan Linstedt and in other designs. He doesn’t persist it, as it is not needed in satellites. But as I have a history layer I want to keep it for later use in satellites (I describe it later). If I can’t set e.g. the dss_sequence_no
right in the load table, I add it in the stage step. Having it auto-increment in the load table is not possible, because I can have multiple batches.
The dss_record_source
belongs to the Data Vault Standard. Just add it the way you want. We have the following rules:
- {source_name}.{source_database}.{source_schema}.{source_table}
- {source_name}.api.{api}
- {filename} (which contains the subdirectory structure including the source_name)
Stage
The creation of the stage table is discussed in-depth in another blog post.
The goal is to add any missing system column which are not yet defined in the load table and adding a hash value for dss_change_hash
.
If you have many batches, you can speed up the next step by adding a non-clustered index on dss_load_datetime
.
History
A system of records for a given business key at a specific time. Only changes from one batch to another are persisted. How do we identify a batch? It’s the dss_load_datetime
and this is always counting towards the future. Only in the first load you can load historical data. After that only new data arrives.
Persisting data over time needs at least the following system columns
dss_start_datetime
dss_end_datetime
WhereScape has also these other system columns on offer:
dss_current_flag
dss_version
Someone might think that the dss_start_datetime
is the same as dss_load_datetime
. I thought about that at the beginning too. But I had troubles later joining different datasets together.
I concluded that I start with the first batch at 0001-01-01 and having an end date of 9999-12-31. With the 2nd historization and following, dss_load_datetime
matches dss_start_datetime
. I still have the dss_load_datetime
in my history table. This is used throughout the system.
Views
To have a current view could speed up the development later. No need to write WHERE
clauses you might forget. And you can create in WhereScape an alias to another database, e.g. raw_mart.
The timeline is not needed as an extra view. It is already inside the history table. But to be consistent with later stuff, I added it here.
The error view is a nice add-on if you load data from users which tend to write anything into Excel and you want to be sure that it matches the later process. It doesn’t prevent (aka breaking the load), but informs you.
Stored Procedure
Which steps do I take:
Create Cursor from dss_load_datetime
Update expiring records
Insert new records
INSERT INTO History Table
SELECT Result
SUB-SELECT History Columns
SUB-SELECT Where Not Exists
SUB-SELECT Extract from Stage Table
Cursor
First, extract the last saved dss_load_datetime
from the history table.
The cursor consists of a distinct and sorted list of dss_load_datetime
which are equal or greater than the last dss_load_datetime
. Why equal and not just greater than? If something breaks, I want to start over where it ended. And if it just takes another second to check, it is a safety check and no duplicate data is inserted.
Especially while migrating old data with thousands of batches, sometimes loading breaks. So, it is a good way to restart where it ended and not just do all over again. This feature is not available by default in WhereScape, so I used views before templates existed to just load not yet historized data.
Update expiring records
It sets for the expiring record a dss_end_datetime
as 100 nanoseconds before the next dss_start_datetime
. In that way, we can use BETWEEN
in SQL statements. There is no time tick available 2 times. I recently learned another view on that topic from Dirk Lerner, but this is for another blog post. This is currently our rule set.
Extract from Stage Table
All the cleaning and transforming is done in stage. Now it’s time to load new data.
WHERE dss_load_datetime = cursor
We don’t clean for duplicates. The indexes will make sure that it will crash if you hit one.
Where Not Exists
To find out if a record has changed we need to get the latest record from the history table and check if the dss_change_hash
differs.
SELECT *
FROM (sub_query) AS history_table
WHERE NOT EXISTS (
SELECT 1
FROM history_table AS history_table_not_exist
WHERE <business key columns> = <business key columns>
AND dss_change_hash = dss_change_hash
AND dss_current_flag = 'Y'
)
History Columns
WhereScape has 4 history columns on offer. If you don’t need dss_version
or dss_current_flag
, delete them.
dss_start_datetime | CASE WHEN current_rows.shop_code IS NULL
THEN CAST('0001-01-01' AS DATETIME2(7))
ELSE hist_demo.dss_load_datetime
END |
dss_end_datetime | CAST('9999-12-31' AS DATETIME2(7)) |
dss_current_flag | 'Y' |
dss_version | CASE WHEN current_rows.shop_code IS NULL
THEN 1
ELSE current_rows.dss_version + 1
END |
To get the last version we need to add a LEFT OUTER JOIN
to extract the maximum version for a business key. See below.
Add-On
There is also a performance booster available if you insert history data with 1000s of batches. I experienced, when I refresh the indexes every 100 batches the insert performance is accelerated. But only if the table is not too big. Delete it or use it.
Future
Instead of looping through a cursor I want to load it in one go. I didn’t had time yet to explore and define a template, so you will have to wait. And it needs to be bullet-proof, because the history layer is the foundation for everything afterwards.
Result
Here is the query to UPDATE
:
UPDATE [TABLEOWNER].[hist_demo] WITH ( TABLOCKX )
SET dss_end_datetime = DATEADD(NS,-100,stage_hist_demo.dss_load_datetime)
,dss_current_flag = 'N'
,dss_update_datetime = @v_current_datetime
FROM [TABLEOWNER].[stage_hist_demo] stage_hist_demo
WHERE stage_hist_demo.dss_load_datetime = @v_batch_start
AND hist_demo.dss_current_flag = 'Y'
AND hist_demo.shop_code = stage_hist_demo.shop_code
AND hist_demo.sid = stage_hist_demo.sid
AND hist_demo.request_id = stage_hist_demo.request_id
AND (
hist_demo.dss_change_hash <> stage_hist_demo.dss_change_hash
)
Here is the query to INSERT
:
INSERT INTO [TABLEOWNER].[hist_demo] WITH ( TABLOCKX )
(
shop_code
,sid
,request_id
,times
,order_id
,total_price
,dss_load_datetime
,dss_sequence_no
,dss_record_source
,dss_change_hash
,dss_start_datetime
,dss_end_datetime
,dss_current_flag
,dss_version
,dss_create_datetime
,dss_update_datetime
)
SELECT
hist_demo.shop_code AS shop_code
,hist_demo.sid AS sid
,hist_demo.request_id AS request_id
,hist_demo.times AS times
,hist_demo.order_id AS order_id
,hist_demo.total_price AS total_price
,hist_demo.dss_load_datetime AS dss_load_datetime
,hist_demo.dss_sequence_no AS dss_sequence_no
,hist_demo.dss_record_source AS dss_record_source
,hist_demo.dss_change_hash AS dss_change_hash
,hist_demo.dss_start_datetime AS dss_start_datetime
,hist_demo.dss_end_datetime AS dss_end_datetime
,hist_demo.dss_current_flag AS dss_current_flag
,hist_demo.dss_version AS dss_version
,@v_current_datetime AS dss_create_datetime
,@v_current_datetime AS dss_update_datetime
FROM (
SELECT hist_demo.*
,CASE WHEN current_rows.shop_code IS NULL
THEN CAST('0001-01-01' AS DATETIME2(7))
ELSE hist_demo.dss_load_datetime
END AS dss_start_datetime
,CAST('9999-12-31' AS DATETIME2(7)) AS dss_end_datetime
,'Y' AS dss_current_flag
,CASE WHEN current_rows.shop_code IS NULL
THEN 1
ELSE current_rows.dss_version + 1
END AS dss_version
FROM (
SELECT hist_demo.*
FROM (
SELECT
stage_hist_demo.shop_code AS shop_code
,stage_hist_demo.sid AS sid
,stage_hist_demo.request_id AS request_id
,stage_hist_demo.times AS times
,stage_hist_demo.order_id AS order_id
,stage_hist_demo.total_price AS total_price
,stage_hist_demo.dss_load_datetime AS dss_load_datetime
,stage_hist_demo.dss_sequence_no AS dss_sequence_no
,stage_hist_demo.dss_record_source AS dss_record_source
,stage_hist_demo.dss_change_hash AS dss_change_hash
FROM [TABLEOWNER].[stage_hist_demo] stage_hist_demo
WHERE stage_hist_demo.dss_load_datetime = @v_batch_start
) AS hist_demo
WHERE NOT EXISTS (
SELECT 1
FROM [TABLEOWNER].[hist_demo] hist_demo__not_exist
WHERE hist_demo__not_exist.shop_code = hist_demo.shop_code
AND hist_demo__not_exist.sid = hist_demo.sid
AND hist_demo__not_exist.request_id = hist_demo.request_id
AND hist_demo__not_exist.dss_change_hash = hist_demo.dss_change_hash
AND hist_demo__not_exist.dss_current_flag = 'Y'
)
) AS hist_demo
LEFT OUTER JOIN (
SELECT
hist_demo__current.shop_code AS shop_code
,hist_demo__current.sid AS sid
,hist_demo__current.request_id AS request_id
,MAX(hist_demo__current.dss_version) AS dss_version
FROM [TABLEOWNER].[hist_demo] hist_demo__current
GROUP BY
hist_demo__current.shop_code
,hist_demo__current.sid
,hist_demo__current.request_id
) AS current_rows
ON current_rows.shop_code = hist_demo.shop_code
AND current_rows.sid = hist_demo.sid
AND current_rows.request_id = hist_demo.request_id
) AS hist_demo
Of course, it would be possible to replace both steps with a MERGE
statement. My experience with that is not the best. With huge tables or data sets it just takes time and sometimes SQL Server „got irritated“ about what to do.
Current View
Often, we use only the current valid record. Instead of writing everywhere WHERE
clauses, I just create a view from the history table and select the appropriate template. That’s it. What will it do?
It adds a simple WHERE dss_current_flag = 'Y'
. Not that difficult. But it will get more complex when we are coming to Satellites and Link Satellites.
If this view is business critical, I added also a template for an indexed view.
Implementation
Download from the template section the following templates for WhereScape written in Pebble:
- nu3_sqlserver_procedure_hist
- nu3_sqlserver_view_hist_current
To set up:
- Select the template „nu3_sqlserver_procedure_history“
- Set the Business Key Columns
- Set the Change Detection Fields to „dss_change_hash“
That’s it.
Dez. 9 2017
Template: History (DataStore)
Creating a history layer (or Persistent Staging Area from Roeland Vos) is a safeguarding layer. All data arrives unmodified from the source.
People in the Data Vault Space would probably suggest creating a Source Vault if I don’t want to build a proper Raw Vault with good Business Keys. I considered that option too. I would have probably built it also that way. But I thought, if WhereScape has a history layer already built in, why not using it? And I used that kind of history already in the past.
So, how do I ingest data to the history layer?
What are the requirements:
Load
We have any kind of source type and source system. Loading the external data to the landing table is the first step. Unmodified, as original as possible.
To the loaded data, 3 system columns are added:
dss_load_datetime
dss_sequence_no
dss_record_source
What is the
dss_load_datetime
? It is the datetime of that instance when the batch is inserted into the data warehouse. I give the whole batch the same datetime although it might load for minutes. Very important, it is not the extraction date. It is rather the insert date. See it as a human readable rowversion. In my language, it is also a batch. And there can be many batches in the load table. E.g. looping through files or looping through similar databases having the same data structure.Each row in a batch is sequenced. I have seen that in the book from Dan Linstedt and in other designs. He doesn’t persist it, as it is not needed in satellites. But as I have a history layer I want to keep it for later use in satellites (I describe it later). If I can’t set e.g. the
dss_sequence_no
right in the load table, I add it in the stage step. Having it auto-increment in the load table is not possible, because I can have multiple batches.The
dss_record_source
belongs to the Data Vault Standard. Just add it the way you want. We have the following rules:Stage
The creation of the stage table is discussed in-depth in another blog post.
The goal is to add any missing system column which are not yet defined in the load table and adding a hash value for
dss_change_hash
.If you have many batches, you can speed up the next step by adding a non-clustered index on
dss_load_datetime
.History
A system of records for a given business key at a specific time. Only changes from one batch to another are persisted. How do we identify a batch? It’s the
dss_load_datetime
and this is always counting towards the future. Only in the first load you can load historical data. After that only new data arrives.Persisting data over time needs at least the following system columns
dss_start_datetime
dss_end_datetime
WhereScape has also these other system columns on offer:
dss_current_flag
dss_version
Someone might think that the
dss_start_datetime
is the same asdss_load_datetime
. I thought about that at the beginning too. But I had troubles later joining different datasets together.I concluded that I start with the first batch at 0001-01-01 and having an end date of 9999-12-31. With the 2nd historization and following,
dss_load_datetime
matchesdss_start_datetime
. I still have thedss_load_datetime
in my history table. This is used throughout the system.Views
To have a current view could speed up the development later. No need to write
WHERE
clauses you might forget. And you can create in WhereScape an alias to another database, e.g. raw_mart.The timeline is not needed as an extra view. It is already inside the history table. But to be consistent with later stuff, I added it here.
The error view is a nice add-on if you load data from users which tend to write anything into Excel and you want to be sure that it matches the later process. It doesn’t prevent (aka breaking the load), but informs you.
Stored Procedure
Which steps do I take:
Create Cursor from dss_load_datetime
Update expiring records
Insert new records
INSERT INTO History Table
SELECT Result
SUB-SELECT History Columns
SUB-SELECT Where Not Exists
SUB-SELECT Extract from Stage Table
Cursor
First, extract the last saved
dss_load_datetime
from the history table.The cursor consists of a distinct and sorted list of
dss_load_datetime
which are equal or greater than the lastdss_load_datetime
. Why equal and not just greater than? If something breaks, I want to start over where it ended. And if it just takes another second to check, it is a safety check and no duplicate data is inserted.Especially while migrating old data with thousands of batches, sometimes loading breaks. So, it is a good way to restart where it ended and not just do all over again. This feature is not available by default in WhereScape, so I used views before templates existed to just load not yet historized data.
Update expiring records
It sets for the expiring record a
dss_end_datetime
as 100 nanoseconds before the nextdss_start_datetime
. In that way, we can useBETWEEN
in SQL statements. There is no time tick available 2 times. I recently learned another view on that topic from Dirk Lerner, but this is for another blog post. This is currently our rule set.Extract from Stage Table
All the cleaning and transforming is done in stage. Now it’s time to load new data.
WHERE dss_load_datetime = cursor
We don’t clean for duplicates. The indexes will make sure that it will crash if you hit one.
Where Not Exists
To find out if a record has changed we need to get the latest record from the history table and check if the
dss_change_hash
differs.History Columns
WhereScape has 4 history columns on offer. If you don’t need
dss_version
ordss_current_flag
, delete them.CASE WHEN current_rows.shop_code IS NULL
THEN CAST('0001-01-01' AS DATETIME2(7))
ELSE hist_demo.dss_load_datetime
END
CAST('9999-12-31' AS DATETIME2(7))
'Y'
CASE WHEN current_rows.shop_code IS NULL
THEN 1
ELSE current_rows.dss_version + 1
END
To get the last version we need to add a
LEFT OUTER JOIN
to extract the maximum version for a business key. See below.Add-On
There is also a performance booster available if you insert history data with 1000s of batches. I experienced, when I refresh the indexes every 100 batches the insert performance is accelerated. But only if the table is not too big. Delete it or use it.
Future
Instead of looping through a cursor I want to load it in one go. I didn’t had time yet to explore and define a template, so you will have to wait. And it needs to be bullet-proof, because the history layer is the foundation for everything afterwards.
Result
Here is the query to
UPDATE
:Here is the query to
INSERT
:Of course, it would be possible to replace both steps with a
MERGE
statement. My experience with that is not the best. With huge tables or data sets it just takes time and sometimes SQL Server „got irritated“ about what to do.Current View
Often, we use only the current valid record. Instead of writing everywhere
WHERE
clauses, I just create a view from the history table and select the appropriate template. That’s it. What will it do?It adds a simple
WHERE dss_current_flag = 'Y'
. Not that difficult. But it will get more complex when we are coming to Satellites and Link Satellites.If this view is business critical, I added also a template for an indexed view.
Implementation
Download from the template section the following templates for WhereScape written in Pebble:
To set up:
That’s it.
By ecki • Business Intelligence 0