Timeline calculation for Satellite views

My last blogpost about the universal loading pattern has generated a huge number of visitors to my blog. In the recent months I have been working and refining the model.

Today I would like to talk about the last part in that model. Views based on satellites. Satellites are usually built with INSERT only. Any logic interpreting the data should be deferred to a view. What kind of views do I have in our system:

  • First: Not used yet but possible, retrieves the first record of a satellites surrogate key
  • Last: Retrieves the last valid record. Could also be called „current“
  • Timeline: Generates a timeline for LoadDatetime and EventDatetime. EndDate is 9999-12-31.
  • Infinity: Generates a timeline for LoadDatetime and EventDatetime. In contrast to the previous timeline the first record in the timeline has StartDate = 0001-01-01.

Especially the last option „infinity“ has been of great help. For effectivity satellites it was important to get the timeline as wide as possible. But also in other circumstances like adding purchasing costs to a sales order line. Instead of searching for the first valid purchase cost which might be after the first sales I can be sure, that I always get a price.

At first, I had a configurable timeline where I could „overwrite“ the first StartDate. Over time it became difficult to distinguish timelines with a wide date bandwidth and those without. To make it clearer for us, we decided to create 2 separate views if needed. A traditional timeline and a wide timeline. „Infinity“ sounds great for that.

Let’s see. In our database, satellites can have 2 date columns (of course even more, but I automated those).

  • LoadDatetime
  • EventDatetime

Instead of using LoadDatetime as the first value to calculate timelines on, we use own columns to represent the timelines. This has the advantage that I don’t „overwrite“ the LoadDatetime to represent ‚0001-01-01‘. This is also the strategy of WhereScape which is Data Vault compliant.

What are the names used:

  • LoadDatetime
    • StartDatetime
    • EndDatetime
  • EventDatetime
    • EffectivityDatetime
    • ExpiryDatetime

Let’s get to the code. Easy stuff first.

First & Last Value views

One could think that calculating a timeline first and then getting the EndDatetime = ‚9999-12-31‘ would be a nice solution. Still doable, but not fast.

The fastest option I found was to calculate it like that.

SELECT *
FROM   satellite
INNER JOIN (
    SELECT hub_hash_key AS hub_hash_key
          ,MAX(dss_load_datetime) AS dss_load_datetime
    FROM   satellite
    GROUP BY hub_hash_key
) AS last_rows
ON  last_rows.hub_hash_key = satellite.hub_hash_key
AND last_rows.dss_load_datetime = satellite.dss_load_datetime

Nice, easy and fast. For the first value replace the MAX function with MIN.

Timeline & Infinity views

When calculating only one timeline from a date is easy. To calculate it with 2 date columns is just a little trickier because of odd but understandable behavior. A bi-temporal view comes some when in the future when I grasp it.

Timeline calculation uses a lot of window functions and execution plans are quite bad as it is sorting the data forth and back.

What are the requirements:

  • dss_load_datetime/dss_event_datetime: Stays as it is
  • dss_start_datetime/dss_effectivity_datetime: Takes the datetime. For infinity views it should represent the first value of the timeline as ‚0001-01-01‘
  • dss_end_datetime/dss_end_datetime: Takes the next datetime or if there is none ‚9999-12-31‘

What will be the PARTITION BY clauses for each timeline:

  • dss_load_datetime: The hub_hash_key of the satellite
  • dss_event_datetime: The hub_hash_key or the driving_key of the satellite

What is the driving key? The driving key is useful for effectivity satellites to force a 1:m instead of a m:n relationship. A good blog post which describes it in-depth can be found on Roelant Vos’s Blog.

From my universal loading pattern blog post, you see that I organized all logic in sub-queries. To calculate the timeline, I need 3 sub-queries.

  1. Sorting the whole dataset by dss_load_datetime
  2. Calculate dss_start_datetime and dss_end_datetime
  3. Calculate dss_effectivity_datetime and dss_expiry_datetime

Sorting dataset

I use this function not only in satellites but also for calculating timelines in stage tables. Sometimes a cartesian product happens and produces bad results. In a cartesian product 2 or more lines look the same. It can happen that for the LoadDatetime timeline the first row is being used and for the EventDatetime timeline the second. Resulting in both records being returned and violating a unique constrain.

Therefore, I sort the whole dataset by dss_load_datetime and use that number in the following queries.

ROW_NUMBER() OVER (PARTITION BY hub_hash_key ORDER BY dss_load_datetime ASC) AS dss_row_no

Calculate LoadDatetime timeline

For normal timelines just take the dss_load_datetime as dss_start_datetime. For infinity timelines where we want to set the first dss_start_datetime as '0001-01-01', we need to figure out, if this record is the first or not.

CASE WHEN LAG(dss_load_datetime,1,NULL) 
          OVER (PARTITION BY hub_hash_key ORDER BY dss_row_no ASC) IS NULL
     THEN '0001-01-01'
     ELSE dss_load_datetime
END

For dss_end_datetime we need the following window function query:

LEAD(DATEADD(ns,-100,dss_load_datetime),1,CAST('9999-12-31' AS DATETIME2(7))) 
OVER (PARTITION BY hub_hash_key ORDER BY dss_row_no ASC )

As you can see, we order both queries with the dss_row_no from the previous query. As we need both window function LEAD and LAG, SQL Server will sort the data forth and back. I didn’t find another solution for that yet.

Currently I use a DATEADD function to subtract 1 time tick to being able to use BETWEEN queries. If you don’t need that, just remove the DATEADD function.

Calculate EventDatetime timelines

The calculation looks like the previous one. Except that we can sort the data by hub_hash_key or driving_key.

In contrast to satellites where we can be sure, that we don’t get any duplicates per hub_hash_key and dss_load_datetime we can get duplicates on dss_event_datetime. Instead of just sorting the data by dss_event_datetime we use also dss_load_datetime represented by dss_row_no for sorting.

The calculation for the dss_effectivity_datetime looks like the following:

CASE WHEN LAG(dss_event_datetime,1,NULL) 
          OVER (PARTITION BY driving_key ORDER BY dss_event_datetime ASC,dss_row_no ASC) IS NULL
     THEN '0001-01-01'
     ELSE dss_event_datetime
END

The calculation for the dss_expiry_datetime looks the same as for the dss_end_datetime.

LEAD(DATEADD(ns,-100,dss_event_datetime),1,'9999-12-31') 
OVER (PARTITION BY driving_key ORDER BY dss_event_datetime ASC,dss_row_no ASC)

Again, the DATEADD function can be removed.

You see that we can end up with 4 sorting functions which can have a huge impact on performance. Create 2 separate views or persist them (or not).

Happy sorting!