Jun 16 2018
My last blogpost about the universal loading pattern has generated a huge number of visitors to my blog. In the recent months I have been working and refining the model.
Today I would like to talk about the last part in that model. Views based on satellites. Satellites are usually built with
INSERT only. Any logic interpreting the data should be deferred to a view. What kind of views do I have in our system:
- First: Not used yet but possible, retrieves the first record of a satellites surrogate key
- Last: Retrieves the last valid record. Could also be called “current”
- Timeline: Generates a timeline for LoadDatetime and EventDatetime. EndDate is 9999-12-31.
- Infinity: Generates a timeline for LoadDatetime and EventDatetime. In contrast to the previous timeline the first record in the timeline has StartDate = 0001-01-01.
Especially the last option “infinity” has been of great help. For effectivity satellites it was important to get the timeline as wide as possible. But also in other circumstances like adding purchasing costs to a sales order line. Instead of searching for the first valid purchase cost which might be after the first sales I can be sure, that I always get a price.
At first, I had a configurable timeline where I could “overwrite” the first StartDate. Over time it became difficult to distinguish timelines with a wide date bandwidth and those without. To make it clearer for us, we decided to create 2 separate views if needed. A traditional timeline and a wide timeline. “Infinity” sounds great for that.
Let’s see. In our database, satellites can have 2 date columns (of course even more, but I automated those).
Instead of using LoadDatetime as the first value to calculate timelines on, we use own columns to represent the timelines. This has the advantage that I don’t “overwrite” the LoadDatetime to represent ‘0001-01-01’. This is also the strategy of WhereScape which is Data Vault compliant.
What are the names used:
Let’s get to the code. Easy stuff first.
First & Last Value views
One could think that calculating a timeline first and then getting the EndDatetime = ‘9999-12-31’ would be a nice solution. Still doable, but not fast.
The fastest option I found was to calculate it like that.
SELECT * FROM satellite INNER JOIN ( SELECT hub_hash_key AS hub_hash_key ,MAX(dss_load_datetime) AS dss_load_datetime FROM satellite GROUP BY hub_hash_key ) AS last_rows ON last_rows.hub_hash_key = satellite.hub_hash_key AND last_rows.dss_load_datetime = satellite.dss_load_datetime
Nice, easy and fast. For the first value replace the
MAX function with
Timeline & Infinity views
When calculating only one timeline from a date is easy. To calculate it with 2 date columns is just a little trickier because of odd but understandable behavior. A bi-temporal view comes some when in the future when I grasp it.
Timeline calculation uses a lot of window functions and execution plans are quite bad as it is sorting the data forth and back.
What are the requirements:
dss_event_datetime: Stays as it is
dss_effectivity_datetime: Takes the datetime. For infinity views it should represent the first value of the timeline as ‘0001-01-01’
dss_end_datetime: Takes the next datetime or if there is none ‘9999-12-31’
What will be the
PARTITION BY clauses for each timeline:
hub_hash_keyof the satellite
driving_keyof the satellite
What is the driving key? The driving key is useful for effectivity satellites to force a 1:m instead of a m:n relationship. A good blog post which describes it in-depth can be found on Roelant Vos’s Blog.
From my universal loading pattern blog post, you see that I organized all logic in sub-queries. To calculate the timeline, I need 3 sub-queries.
- Sorting the whole dataset by
I use this function not only in satellites but also for calculating timelines in stage tables. Sometimes a cartesian product happens and produces bad results. In a cartesian product 2 or more lines look the same. It can happen that for the LoadDatetime timeline the first row is being used and for the EventDatetime timeline the second. Resulting in both records being returned and violating a unique constrain.
Therefore, I sort the whole dataset by
dss_load_datetime and use that number in the following queries.
ROW_NUMBER() OVER (PARTITION BY hub_hash_key ORDER BY dss_load_datetime ASC) AS dss_row_no
Calculate LoadDatetime timeline
For normal timelines just take the
dss_start_datetime. For infinity timelines where we want to set the first
'0001-01-01', we need to figure out, if this record is the first or not.
CASE WHEN LAG(dss_load_datetime,1,NULL) OVER (PARTITION BY hub_hash_key ORDER BY dss_row_no ASC) IS NULL THEN '0001-01-01' ELSE dss_load_datetime END
dss_end_datetime we need the following window function query:
LEAD(DATEADD(ns,-100,dss_load_datetime),1,CAST('9999-12-31' AS DATETIME2(7))) OVER (PARTITION BY hub_hash_key ORDER BY dss_row_no ASC )
As you can see, we order both queries with the
dss_row_no from the previous query. As we need both window function
LAG, SQL Server will sort the data forth and back. I didn’t find another solution for that yet.
Currently I use a
DATEADD function to subtract 1 time tick to being able to use
BETWEEN queries. If you don’t need that, just remove the
Calculate EventDatetime timelines
The calculation looks like the previous one. Except that we can sort the data by
In contrast to satellites where we can be sure, that we don’t get any duplicates per
dss_load_datetime we can get duplicates on
dss_event_datetime. Instead of just sorting the data by
dss_event_datetime we use also
dss_load_datetime represented by
dss_row_no for sorting.
The calculation for the
dss_effectivity_datetime looks like the following:
CASE WHEN LAG(dss_event_datetime,1,NULL) OVER (PARTITION BY driving_key ORDER BY dss_event_datetime ASC,dss_row_no ASC) IS NULL THEN '0001-01-01' ELSE dss_event_datetime END
The calculation for the
dss_expiry_datetime looks the same as for the
LEAD(DATEADD(ns,-100,dss_event_datetime),1,'9999-12-31') OVER (PARTITION BY driving_key ORDER BY dss_event_datetime ASC,dss_row_no ASC)
DATEADD function can be removed.
You see that we can end up with 4 sorting functions which can have a huge impact on performance. Create 2 separate views or persist them (or not).