ecki

About ecki

Posts by Eckhard Zemp:

Apr. 11 2020

SQL Server locking methods or the fight against dead locks

After the last Index blog post here another technical issue I recently thought to have solved. At least maybe for the next year until I find another solution.

As you know I dropped all indexes of my data warehouse and therefore also unique constrains on my tables. The application should make sure that everything is fine. But I have a regular report which checks, if the constrains are still intact.

Lately after working on a project with huge amount of data, I got hit occasionally with duplicate rows in my hubs and links.

This was the reason to investigate further about why it did happen. Although I have used all measures below, it happened. Maybe the unique constraint indexes have additional functions when loading data to it. I must admit, I’m not the export in that area. I’m educated by mistakes and experience. Therefore, this is a little sum up about my lessons learnt.

Depending on the number of queries running at the same time with SELECT, UPDATE and DELETE, SQL Server decides, which data is locked, read, waited for, overwritten or inserted. There are some methods to influence its decision making.

SET TRANSACTION ISOLATION LEVEL

Working with transactions is usually something for applications having SQL Server as a backend, to ensure, that the data is written correctly to the database spanning multiple table manipulations. In the data area I didn’t respect to work with transactions for some time, as I was usually transforming data from source to target, hardly inserting data in parallel into the same table. With Data Vault it is rather a usual pattern.

The transaction is written like that:

BEGIN TRANSACTION
  INSERT INTO ...
  SELECT *
  FROM ...
COMMIT
OPTION(RECOMPILE)

By default I add OPTION(RECOMPILE) after fighting against bad execution plans.

This layout I had in place, when I discovered duplicate entries. My search for a solution led me to learn more about Transaction Isolation Level. SQL Server has a few of them:

READ UNCOMMITTED
READ COMMITTED (default)
REPEATABLE READ
SNAPSHOT
SERIALIZABLE

I don’t want to explain all the nifty details about each option. There are plenty of resources at google.

I suspected that although I had all those table locks (see below) in place, data got changed between an INSERT and its SELECT statement. Especially in fact loading with looking up dimension keys. Sounds weird, but I couldn’t help suspecting.

Many websites express, that SERIALIZABLE is the most secure option, but has a very hard disadvantage – it leads to deadlocks. A deadlock is an issue when SQL Server tries to find out what to do next in a safe way. Sometimes 2 queries want to have access to the same resource, but queuing won’t help as other resources are blocked too. SQL Server terminates 1 query telling it that is a „victim of deadlock“. With restartable jobs this isn’t an issue. But still bad.

I added the following statement to my scripts:

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE

All following transactions are then executed with that isolation level.

I was hit with a lot of deadlock errors in the coming weeks. But I got the issue solved about having duplicates in hubs and links.

Hints

Hints are another way to manipulate the SQL Servers decision engine. There are a lot of hints, which we can add to a query. Some are affecting the whole query manipulating the execution plan (query hints) and some the queried table (table hints).

Query hints are added at the end with OPTION(something). As you know, by default I add RECOMPILE to it to enforce getting the best plan for every data set instead of recycling old plans. Sometimes I add FORCE ORDER or HASH JOIN to it, depending on performance.

To combat my issue, I rather use table hints. My favorite is TABLOCKX. By default, SQL Server tries to lock as little as possible. Sometimes only pages or rows of a table. But I want to be sure that no other query sneaks into a running query manipulating something in between. TABLOCK will try to grab a „shared“ lock on that table while the X gets an exclusive lock.

My statement looked usually like that:

INSERT INTO ... WITH (TABLOCKX)
SELECT *
FROM ...

But somehow, I still got duplicates in my hubs and links. How could that be? It must be something with the SELECT query. My SELECT query has also a part in it where it checks the very same table, I’m inserting data into for records that already exists. Could it be that the INSERT locks are not applied to the SELECT tables although being the same table?

Reading through other blogs I have seen following recipe:

INSERT INTO ... WITH (TABLOCKX)
SELECT *
FROM ... WITH (UPDLOCK)

UPDLOCK acquires a modify lock telling the system it intends to modify the data but in fact I won’t modify it. I just want that other queries will modify them when I’m reading it.

Maybe this is the way to go, maybe not. It’s hard to find something useful in the data warehousing space as usually all the questions asked are for applications and record level issues.

While we talk about table hints, I got also into another issue. In my pattern collection I have 3 different insert methods. Each optimized for perfect fast loading. One is for the initial load. When the table is empty, I don’t need to retrieve the last record e.g. with Satellite loading for comparing hash values and deciding if it is a change or not. If the table is empty, just pump the data in.

IF EXISTS (SELECT 1 FROM table)
  QUERY INITIAL INSERT
ELSE
  QUERY NORMAL INSERT

So, what happens? The first query will check if there is a record in the table. Then it will run query 1 or query 2. But what happens in the split second between the 2 queries? Or maybe there is another INSERT already running? As the first query will run in READ COMMITTED it won’t find a record and tells the initial insert to not check for existing data.

We must modify it a little and adding table hints.

IF EXISTS (SELECT 1 FROM table WITH(UPDLOCK,HOLDLOCK))
  QUERY INITIAL INSERT
ELSE
  QUERY INSERT

The UPDLOCK will wait until any insert has finished and sets a modify lock. If the query is run with only UPDLOCK it will check for the record and releases the lock immediately again. The HOLDLOCK will keep the LOCK until the transaction is committed. The first statements sets a lot which is kept until the end.

The final layout looked like that:

IF EXISTS (SELECT 1 FROM table WITH(UPDLOCK,HOLDLOCK))
  INSERT INTO ... WITH (TABLOCKX,HOLDLOCK)
  SELECT *
  FROM ... WITH (UPDLOCK,HOLDLOCK)
ELSE
  QUERY INSERT

I believe there are way too many locks. And again, I had a lot of deadlock issues to deal with.

Is there not something easier to work with? Maybe something like a queue?

Application Lock

There is another method in SQL Server to work with: Application Lock. This means, it is completely independent of any table locks.

With this lock we can invent a locking mechanism by defining a random name for a resource. As my procedures are loading data to a target, I use the target name as a resource. It has nothing to do with the underlying table and doesn’t affect it.

How to set a lock:

sp_getapplock @Resource = 'table', @LockMode = 'Exclusive', @LockOwner = 'Session'

With this statement we request a virtual lock for the resource named ‚table‘. If any other query or stored procedure issues the same statement it will get into the queue and waits until the first lock gets released.

There are 2 levels to define these locks:

Session
Transaction

If we define @LockOwner = 'Transaction' the lock gets released after the commit.

But as I could have multiple transactions in one script, I rather like to specify it on session level. The whole script has to run through until I release the lock with the following statement at the end.

sp_releaseapplock @Resource = 'table', @LockOwner = 'Session'

My full script looks like that:

DECLARE @v_lock_code_{table} INT
EXEC @v_lock_code_{table} = sp_getapplock @Resource = '{table}', @LockMode = 'Exclusive', @LockOwner = 'Session'
IF @v_lock_code_{table} NOT IN (0,1)
BEGIN
  RAISERROR('Unable to acquire exclusive lock on {table}', 16, 1)
  RETURN
END
... a lot of code ...
EXEC @v_lock_code_{table} = sp_releaseapplock @Resource = '{table}', @LockOwner = 'Session'

Summary

With this method I create locks on all data manipulation objects. The queue will complete query after query and tries not to sneak into other workflows. If I finish an object, I release the lock and everything is good.

The focus is really to have a queue for every INSERT or UPDATE object.

SELECT queries or JOINs don’t need to issue this lock. The table is free to read. By the way: with the table hint method above there is also a hint named NOLOCK which should disrespect any existing locks. But when I use it, I seldom get any data returned until all locks are released.

With application locks I could also move back to READ COMMITTED isolation level.

Another side effect is, that other non-related queries are not locked too. Even though a TABLOCKX is only for a specific table, I have experienced locks on other transactions and in other databases. Lately I made a test. I stopped the whole execution engine on my test machine and started all jobs at the same time. I have never seen so much traffic on my databases and CPU was at 100%.

I hope, I have now peace of mind and no dead locks anymore.

By ecki • Business Intelligence 0

Jan. 18 2020

How to build a Business Vault Satellite

At the last meetup I learnt that many people don’t know how to build a business vault satellite (or even that it exists!). Many are struggling to get the data into a Raw Data Vault. Persisting data from the source over time.

On the market are a few solutions, which build Data Vaults nearly automatically. According to some metadata fields from the source a data vault can be built quite easily. Some are spending a lot of time tuning model conversion rules or with data modelling. Don’t get me wrong, modelling is an important part! But when the core model is defined, matching source data to it shouldn’t take long.

Most of my time is spent on implementing business rules. Persisting data in a Raw Vault needs no time.

As you have seen with my other blog posts, I like patterns. Repeatable patterns. I have heard that there are rule engines and other fancy stuff out there. So far, I can’t imagine how such a rule engine could work. Maybe somebody can show me their solution.

Developing

Before we are going into the patterns, what kind of tools are available:

Graphical interfaces such as SSIS, Informatica, Talend and others
Programming outside SQL with Python, Scala, etc.
Programming in SQL

Graphical Interface

I used to work with SSIS. Graphical interfaces are nice. Especially the look at how the data is streaming through each step is very intriguing.

But I found it difficult to use with repeatable patterns. Usually when I learnt something new about how to ingest data, I had to go through all existing loading pipelines and modify the thing I learnt.

To import data from source and building raw vaults definitely a nice way to go.

Programming outside SQL

I’m sure that Python and Scala have their use case and the patterns below can be easily adapted.

If the data platform of your choice has no compute nodes, then it is the default thing to take the data out, do something and save it again in the platform.

Programming in SQL

But as I work with SQL Server, there is no need to take the data out, do something and load it again. Just a waste of bandwidth and the compute power of the server is not used. But I understand that for bigger solutions other ways must be found and used.

When working with SQL I have several choices:

Work with Stored Procedure aligning all calculation steps after each other
Work with Common Table Expression and combining all calculation steps into one statement
Work with Views/Tables building on each other
Or a combination of all above

I use templates to create stored procedures doing all the necessary steps AFTER I create the base SQL query. Therefore, I don’t want to meddle with it by modifying. This would break the maintainability.

I love to work with Common Table Expressions. At least for drafts. Fantastic to build data sets upon each other and enhancing the calculation. But when it comes to debugging, it is very cumbersome to always rewrite the SELECT statement at the end extracting data from each step.

Therefore, I like a combination of Views and Tables to separate each data set and calculation step. For debugging it is perfect. Sometimes I list all elements after each other and filter on a certain key to find out what it is doing in each step and if that is the right behavior.

Feel free to use your own method. Wanted just to share my experience.

Patterns

Just a quick note: If I write about keys, I mean all possible keys in Data Vault: Hub Hash Key, Link Hash Key, Natural Keys, Business Key etc. It just defines the grain of data I’m looking at. Is it on sales order or sales item level? Or something else?

What patterns did I find and use nowadays:

Define the target key
Define the source key
Get all source keys
Calculation
- Get the data for the calculation
  - All values, last values or Point-In-Time
- Define the LoadDate
- Calculate
Prepare Result Data Set

Example

For easier understanding, let’s take an easy example. We have the following Data Vault Model:

hub_customer
hub_sales_order
link_sales_order_item (and feel free to read it also as hub_sales_order_item)

I have a raw vault satellite attached to the hub_sales_order which has a value for discount_amount. To calculate the margin on item level, I need to distribute this amount to all sales_order_items.

Define the target key

Before I start any job, I need to define what the grain of my result set should be. Maybe we have not yet an idea, how the calculation should look like, but I have to make sure, what the target grain should be.

In this example it will be a satellite attached to link_sales_order_item.

Define the source key

Apart from the calculation this is the most important step. What key do I need as a basis.

What will be my source data key?
What will be the grain of detail for the calculation?
Does a „higher level“ key make more sense?

Sometimes it is the same grain as the target. Like when calculating the vat_amount for each sales_order_item.

In our case the level of hub_customer would be too high. The level of hub_sales would just be right. If just a SINGLE sales_order_item changes, I want to recalculate everything for ALL sales_order_items of a given order.

Get all source keys

When the grain of the source key is defined, I collect all source keys for which the calculation is needed.

I discovered that persisting that data into a table has the best performance.

Usually I define in my „delta“ object, what my target satellite will be. The template will look up the maximum LoadDate value from it and use that value to retrieve the newest data from the source.

Sometimes I add a time window by subtracting x days from the retrieved LoadDate to make sure that I don’t miss any beat. At the beginning of my journey I had the aspiration to have only real deltas and wondered how others are doing that when you have 2 source tables while 1 lags a little behind of time. With subtracting x days of the LoadDate this issue can be taken care of. The focus is on preparing the data, but if there is no change, it won’t load to the target.

And sometimes I have multiple satellites to gather my source keys from.

To have such a „bucket“ of keys is really a great thing. I try to find as many rules as possible why something needs an update. The goal is to get all possible keys for which I need to do a calculation or an update of old calculations. E.g. for calculating contribution margin 3, I need to add payment costs. Payment costs are not easily attributable. Therefore a factor is usually needed. If management decides to modify it later and also for old orders, what can I do? Either truncate the satellite and re-calculate it again or add a rule! I usually do the later. I don’t want to think about all actions others might do. It just has to work!

Calculation

As we have now all keys to work with, we can start collecting data and do the calculation.

Get Data for Calculation

Identify all satellites, reference tables and other tables where you need to extract data from.

As all my calculations are views, I create sometimes views of all my source satellites to make it visible what data I’m extracting from where and what I need. Just for easier reading for others. And sometimes not.

Then I join all those data views to my source keys. Of course, there is sometimes the case, that I join a source view later in the calculation steps.

In my example I would join now discount_amount from sat_sales_order_discount and gross_sales_amount from lsat_sales_order_item_amount.

The result data set is now complete. All source keys are available, all needed data points for the calculation are there.

All values, last values or Point-In-Time

While joining the source data I need also to answer the question, if I need all values, last values or to create a mini Point-In-Time table.

If the source key is of the same grain as the target, for example when I clean strings of bad characters, I just take the source satellite and use each row in the calculation.

If the grain is different like in our example, I usually take the last value of the source satellites. For easier immediate access I create by default for all of my satellites last value views.

Of course, I’m not limited to the last values only. I have also other views: Last, First, Current (excluding deleted), Timeline (from timestamp till 9999-12-31), Infinite (from 0001-01-01 till 9999-12-31).

Sometimes my users request also values being calculated for each data point in the past. This means that I create on the fly a mini point-in-time table for all LoadDates since the last load.

At the beginning, I thought that I need now to create a PIT table. But maintaining a full-fledged PIT table is usually not possible. Sometimes I have multiple business rules sitting on each other. When I would create the first PIT for my source satellites, do I create a new PIT for the just updated business vault satellite? And what about the next business vault satellite? Another PIT including it? Or just updating the old one? Kind of stupid. Therefore, I calculate a PIT on the fly with only the LoadDates needed in that pipeline and only with the newest LoadDates.

For our example it is enough to extract the last known value.

Define LoadDate

Usually my calculation is based on one main raw vault satellite. Then I continue to take THAT LoadDate for my newly created business vault satellites data.

If I load from more than one satellite, I calculate sometimes the maximum value of ALL LoadDates for a given business key.

,(
    SELECT MAX(v)
    FROM  (
             VALUES
                     (satellite_1.dss_load_datetime)
                    ,(satellite_2.dss_load_datetime)
                    ,(...)
          ) AS VALUE(v)
) AS dss_load_datetime

If the calculation like the current age and age_group of a customer is so random and needs to get calculated daily, I use the current system datetime as LoadDate. If the age changes I need a new LoadDate, otherwise the new value can’t be loaded into the Satellite.

If the LoadDate is not defined correctly, updates for new results will not get imported into the target, as the primary key (hash key and LoadDate) from the old calculation already exists.

Calculate

Do whatever you want. Sometimes it is a simple calculation, sometimes many steps with intermediate result sets.

In my case, I would create a view summarizing all gross_sales_amounts of sales_order_items for each sales_order.

In the next step I would use a calculation like gross_sales_amount / sum_gross_sales_amount * discount_amount to distribute it to each sales_order_item.

If I need to round it to 2 digits, I will have also to take care about cent differences (sum of all discount_amounts of all sales_order_items versus the discount_amount on sales_order level). I would sort the rows by gross_sales_amount DESC and add the cent difference to that line with the biggest gross_sales_amount.

Prepare Result Data Set

If the calculation is looking correct and I believe I have finished, I select those columns which should be loaded to the business vault satellite, calculate the hash_diff and load the data set to the target. Sometimes I also calculate new hash_keys to create same-as-links or other data vault objects.

Before continuing, take a break and congratulate yourself for the fantastic job you did. Of course, it is 100% error free.

Bonus Task

As you have your results saved, think about ways how to ensure that the result is always correct. Here comes the Error Mart into play.

What can you check:

Do the same task again but in a different way and compare both results
Compare summary of amounts per key of source and target satellite (based on last view)
Compare counts of rows per key of source and target satellite (based on last view)
Define keys which should have a result and check if the keys in the business vault satellite exists
Etc.

Make it a habit that each business vault satellite has at least 1 QA view.

In our case we collect them in a separate database. We placed it next to the reports database. In my opinion it should be rather named as error reports than error mart. And I have error reports for Data Stores, Data Vault, Data Mart and Exports.

The created QA view shows only the faulty keys or rows. Everyday a job is run to count the rows in each QA view and reports a summary for all checks. Then we take actions and fix the issues.

Summary

You have seen my way of building business vault satellites. I hope you agree with the main patterns. If you have more patterns, let me now.

I have heard of business rules engines, but newer seen one. If you have one, get in touch. I would be happy to learn more about it.

Here is also a picture of such a transfer (another case):

By ecki • Business Intelligence 0

Aug. 30 2019

Universal Loading Pattern Reloaded

There was quite some noise, when I published last year my blog post about Universal Loading Pattern, a common view how to process data to any target. Be it a Data Vault Hub or a Fact table.

Since then I have worked with this pattern happily without any incidents.

Some time ago I thought, well, are these patterns still true? Did I forget a pattern? Or has my view towards these patterns changed? This falls fully in line being smart in accordance to Jeff Bezos. Before I read it, I fixed stuff „stealthily“. I didn’t want to appear not knowing it beforehand.

I restarted at the drawing board and have been pondering about my patterns. And then I waited… until some inspiring thoughts appeared. On the way to work, in a lunch break, in the night in the bed (where I should rather sleep than thinking about issues) or while feeding my newly born baby.

And I came up with some changes and surprises. And I believe that the following pattern are so easy to understand, that it is a shame that I didn’t came up with earlier. But that’s usual. First digging into it, exploring the edges, finding common ground to work with and then summarizing up.

In my last post I split up the pipeline into 2 parts for easier consumption. In one of my meetups I learnt that to view it that way is not common. It was caused by the way how WhereScape works.

Here I present the enhanced vendor neutral Universal Loading Pattern. Before we start, do you think Data Vault, Dimensional Modeling or Anchor Modelling are so different? They are just flavors to work with data! I believe I can load them all with this pattern.

Source Query Automation

Here are the bits and pieces, to help me with writing a query and taking care about all those nifty little details. But for that to work, we need a query.

Extract & Transform

This is the place where my input to the processing of my data starts. A simple SELECT statement with FROM, JOINs and WHERE clauses. If I’m happy with the content and the calculation, I use it.

Now the interesting stuff starts.

Delta Loading

For any of my targets I try to load and process only delta data. But I’m too tired to write that in my origin query. And I have an option implemented, to do a full load if I want to. I need to be flexible.

If I structure all the layers as sub queries, I can specify on the next layer just the following statement. SQL Server will still make the best of it and will apply the filter already to the origin query.

WHERE dss_load_datetime > (SELECT COALESCE(MAX(dss_load_datetime),'0001-01-01') FROM [Target])

This works best for views. If I have a stored procedure, I would load the result of the above statement into a variable @last_dss_load_date and use it instead. A lot faster.

Source to Target Data Type mapping

This layer is only important if you care about data type consistency. Like in programming when you define variables with a specific type.

The mapped columns between source and target doesn’t necessarily have to be of the same data type. And if I have transformations in my source query, it is hardly known which data type I just created. Except you specify that in the source query. But I’m lazy about that. The system should fix that.

Any platform tries to avoid producing errors while inserting data into a table. They silently ignore the fact, that the data type don’t match. This is called implicit conversion. Have a look at the rules, what SQL Server does.

In this layer I create an explicit data type conversion. If I have transformations, I CAST it to the target data type. If the source and the target data type don’t match, I use some of the rules from the implicit conversion where I believe they are safe. Anything else throws an error while creating a view or stored procedure.

Business Keys

Last time I was thinking, yeah, a business key is just a business key. Yeah, there are some rules. I know that they are important. But I didn’t know that they are THAT important.

This time I want to spend more time elaborating about business keys.

Business Keys are really fundamental! A table without business keys is really hard to understand and work with. Until now I have been cleaning business keys only for hubs as suggested by Dan Linstedt. If we load data later to dimensions and facts, the business keys for them are already cleaned.

Sometimes I don’t build a data vault model but going straight from my Data Store (=Persistent Staging Area) to Dimensions & Facts. Having a Data Store gives me the freedom to do that and later adding a Data Vault when the business rules are getting too complex and loading time is slow.

I changed my view that not only data vault objects should have their business keys cleaned, but all my tables: Data Stores, Hubs, Links, Satellite, Dimensions and Facts.

But wait, are we not going to change the data which it consists? Shouldn’t be the Data Store equal than the source system? If the source systems CHAR column has mixed cases, shouldn’t we preserve it?

I make strong distinctions of business key columns and data columns. All columns of a source table are by definition data columns. I would derive my business key column from a data column. Business Key columns are here to ensure uniqueness in my target objects. But if the business key column is of type INTEGER, there is no need to duplicate it. If you want to preserve a mixed-case CHAR column, duplicate and preserve it.

Just a side note: If the mixed-case business key column should be preserved, we would have anyway a problem. Usually database joins don’t make a distinction of mixed cases. Then I would need to specify a collation with „ignore cases“ being turned off or using a binary collation.

Cleaning

So, how are business keys cleaned? The goal is, to get the value as close as possible to NULL. So far I only clean CHAR columns. Remove any spaces. If the data in the column is entered manually by a user and the application doesn’t check for clean data, I must even search for tabs, returns and other strange characters. The goal is to get the essence of the column value. Only human readable characters. A CHAR column should also avoid having mixed cases. I set them all upper case.

My statement is: NULLIF(UPPER(LTRIM(RTRIM(column))),'')

The benefit of trying to NULLify everything is, that the business key column has a constraint of NOT NULL. Therefore, if I hit a NULL it will crash and I will be safe of bad data.

Usually people see the business key cleaning and the following zero key replacement as a minor distinction. In this step I focus on the cleanliness of business keys. If the lineage from the source system via a data store and now to the target is kept, I clean it only once!

Zero Key Replacement

Here I „complete“ unknown data to ensure the working of my target. Data Stores, Hubs, Links, Satellites, Dimensions, Facts have different requirements. And all have Primary Keys. Primary Key values with NULL don’t work. And asking somebody to bring me product „NULL“ usually confuses people.

In the above step we might hit now a business key having a value of NULL. What do we do with it when the target tells me that it shouldn’t be NULL?

The book says for Data Vault to replace it with „-1“. „-1“ is just the value to represent „unknown“.

Let’s take a step back. Is Data Vault the only model which has NULL values? Fact tables referencing Dimensions could have NULL values too. Can we formulate a rule for that?

I used to use „-1“ everywhere in my templates. But I got into problems because of different loading patterns. The right thing to do is, to lookup the value on the linked objects business key meta data. Hubs and Dimensions are the objects, for which we can define zero keys. Any other object references to them to find the right zero key. In WhereScape I can define a zero key for any dimensional column. I will lookup that value and use it as a zero key. For hubs it doesn’t exist, but I would do that too, if it would exist.

Here are my self-defined „unknown“ replacement values, if the linked objects business key has no zero key defined:

Data Type	Zero Key
CHAR	`ISNULL(source_column,'-1')`
INT	`ISNULL(source_column,-1)`
BIT	`ISNULL(source_column,0)`
DECIMAL	`ISNULL(source_column,0)`
DATE	`ISNULL(source_column,'0001-01-01')` Unknown LoadDates are getting `'0001-01-01'` so they will update when a record appears. In the Date Dimension I use `'9999-12-31'` for the zero key, so it is shown next to the newest data.
TIME	`ISNULL(source_column,'00:00:00')`

As you can see, for BIT and TIME the unknown value falls into normal values. It is not possible to find values outside the scope. DECIMALs are used for money values, quantities etc. Having a „-1“ is very dangerous.

Roelant Vos has written a blog post about more flavors of unknown. Maybe this is something more for your liking.

Having 2 distinct rule sets for cleaning business keys, we could still hit a NULL value. How could that be? My fact tables are having also business keys defined. But not all business key columns have a related dimension. E.g. a line number of a sales order.

The result is now, that we have nice-looking business keys. Great!

Surrogate Key

In my previous model I had a layer to calculate hash keys and a layer to lookup dimension keys.

While pondering over the layers, a thought appeared: What do hash keys and dimension keys have in common? And how does a natural key fit into this picture?

Actually, they are all surrogate keys. A surrogate key is a key to represent a business key for further usage. To make it easier and faster.

But what type of surrogate keys do exist? Wikipedia lists the following characteristics for a surrogate key:

the value is system generated
the value is not manipulable by the user or application
the value contains no semantic meaning
the value is not visible to the user or application

Surrogate keys in practice just lists sequence keys with different data types (INT, UUID, GUID, etc.) as valid options.

I would say, Dan Linstedt invented a new surrogate key in the form of a string.

When searching the web, I didn’t find more types of surrogate keys. So, what are the options:

System-generated Surrogate Key (e.g. Sequence Key)

If we have a Fact object as a target, we want to join Dimensions to it. By default, we can join on business keys saved in both objects. If Dimensions and Facts are Views on top of a Data Vault, this will be definitely that case as it is impossible, to include a system-generated sequence key into a view.

If we have tables, we can gain some performance when we add an auto-generated sequence key on the dimension and use that value in the fact object. Instead of a multi-column CHAR join, we could have now a four-byte INTEGER really fast join option.

Theoretically you can also use UUID or GUID or whatever data type is auto generated in your system. But INTEGER is the fastest option.

This was also the idea and option in Data Vault 1.0. Being really fast on joins. There the surrogate key of hubs and links is also a sequence key. The systems used at the beginning of this centenary were not that fast as today’s systems. It made sense to have something like that.

User-generated Surrogate Key (e.g. Natural Key for Hubs and Links)

Instead of looking up the sequence key, Dan Linstedt improved the model to load data in parallel to the target. Everybody is speaking about the variation which consists of Hash Keys and so did I.

But in the last year I learnt more about the patterns behind it. There are platforms for which a hash key is not suggested. (e.g. Teradata has its own mechanism to treat business keys)

Therefore, it makes sense to focus more on the part which is shared on all platforms and what Dan really meant about the improvement.

Instead of having a sequence key we would need a concatenated string of all business keys separated by a separator. Just a string without hashing.

It took me some time to understand what the implications for that is. How it is saved on a hub, how it moves to the link to form the natural link key which then is used in satellites. All as strings.

This confirmed my resolution to see this as the real surrogate key for hubs and links. A hashed version of it is just for saving space, having a consistent data type (e.g. BINARY(16) for MD5) and faster joins.

Although in Dimensions a sequence key is usually used, you could use also natural keys. Inserting data into Dimensions and Facts in parallel as in Data Vault. Do you see the intriguing similar pattern? Dan Linstedt has improved the dimensional model into the new model Data Vault. The patterns are the same. And I believe when I will dig into Anchor Modeling, I will discover the same loading pattern as shown here. Nothing new. To improve your fact and dimension views on top of a Data Vault, generate a natural key as a dimension key. Should give a better join performance.

This is truly a Universal Loading Pattern. Give me your flavor of modelling and I will load it.

To summarize, surrogate keys are just representing the business key and are used to improve performance and to make it easier to work with multi-column business keys.

Data Column

There is nothing we can or should do with data columns. All manipulations are in the origin source query in the first section of the pipeline.

Calculate Change Diff

The only thing what we can do is, to improve the loading speed to the target. While inserting the data we need to know if a row is a change to the existing row or not.

We can compare data in a huge WHERE clause, where we compare each column with the target.

Or we concatenate all column values into one string. Then the WHERE clause is just a one-liner.

This pattern can be used for all objects auch as Data Store, Data Vault and Data Mart.

Of course, saving a huge string is a waste of space. The next section will help with that.

Hashing

I thought hashing is the most important part in Data Vault. Actually, we need it only because of technical deficiencies of the platform.

In the pipeline we have now data columns, cleaned business key columns and surrogate keys.

Now comes the part to improve the performance (or decrease it, as hashing takes its toll on speed too). Hashing is only there to improve things. Not because we need it! If things get bad, don’t hash.

Instead of using natural keys as long strings, we could calculate and use a „representation“ value of that. A representation! Hashing would be an ideal way to accomplish that, as it transforms data of arbitrary size into a fixed size.

We can hash only the natural keys and change diffs.

That’s it! Nothing more to write. Stupid boring hashing.

By the way, have you noticed the multiple use of „representation“? A business key is represented by a surrogate key which is further represented by a hash value. Just to make it easy and fast to work with data.

UNION ALL

So far, we talked only about one data set, one source query. What about having multiple data sets which are loaded to the same destination?

Here is the layer to merge them all into one data set before continuing to the next steps.

Row Condensing

All steps before were source oriented. Now we change view to the target. What do we need to do, to load data to the target?

Which data is necessary to preserve and what can we omit?

Removing Duplicates

Depending on the target object type, we can filter out now unneeded data. The target object tells us what it needs.

Target Object	Rule
Data Store	We need only changes in rows partitioned by Business Key and sorted by LoadDate and SequenceNo
Hub	We need only the first row partitioned by Hub Hash Key and sorted by LoadDate
Link	We need only the first row partitioned by Link Hash Key and sorted by LoadDate
Satellite	We need only changes in rows partitioned by Surrogate Key and sorted by LoadDate (and sorted by EventDate, if we don't need to persist every event)
Dimension	Dimensions should be already unique. No condensing.
Fact	Facts should be already unique. No condensing.

Ok, people are using DISTINCT for hub loading. But this works only for a single batch, a single LoadDate. If you load data in a full data set as I do, we need somehow to get 1 row out. Sorting it is the easiest form.

If you want to learn more about row condensing and more advanced forms, check Roelant’s blog.

Compare last saved row with first new row

When we load data incrementally to Data Stores or Satellites, it could happen, that a particular row with the same change hash is already present in the target.

Therefore, we can filter out all already existing rows in the target.

We have now a clean and relevant data set, ready to get inserted.

Timeline Calculation

Data Vault evolved in the meantime into an insert-only model. All timeline calculations are deferred to views generated on top of a satellite.

As you know, I try to make my work as easy as possible. For creating views, I use this very same pipeline we have been discussing so far. If any of my targets has any of the specific timeline columns, it kicks in. No matter if the object is a table or a view.

Therefore, let’s discuss, how a timeline is built.

I maintain 2 timelines. One is based on the LoadDate, the other on EventDate.

For LoadDate I have the following columns:

dss_start_datetime
dss_end_datetime
dss_is_current

For EventDate I have the following columns:

dss_effectivity_datetime
dss_expiry_datetime

Currently I have no template to build bi-temporal objects. Hopefully… some when…

In a view we just use the calculation. In a table if one of the following columns exists, we need to UPDATE outdated rows after INSERTing new data to a table:

dss_end_datetime
dss_is_current
dss_expiry_datetime

WHERE (NOT) EXISTS

Now comes the last test. If the data to get inserted already exists, I don’t want to cause any primary key violation.

What are the rules to detect existing rows:

Target Object	Unique Constraint Columns
Data Store	Business Key, Load Date and Sequence No
Hub	Business Key or Natural Key
Link	Surrogate Keys from Hubs
Satellite	Surrogate Key (from Hub or Link) and LoadDate
Dimension	Business Key
Fact	Business Key

The unique constrain can be created with the table and this is the true check for existing data.

But… Database systems might not behave as expected. In a WHERE clause, a database system might try to match characters with each other like SQL Server does. It will not find a difference in „ss“ and „ß“. Set a binary collation in the WHERE clause or create the Business Key CHAR columns with a binary collation.

Don’t use hash keys for comparison. An accidental hash collision will be skipped and you are bereft of telling a story that you found one. The primary key constraint violation on the hash key will tell you that you found one. I believe, winning in a lottery is easier than finding a hash collision.

If you load Hubs and Links separately from Satellites (like here), this step can happen already after the Business Key cleaning. No need to go into the slow hashing function.

INSERT, UPDATE, DELETE

When we know now, if the data already exists or not, we can fire now the commands to change the data.

Target Object	NOT EXISTS	EXISTS
Data Store	`INSERT`	Skipped. There can't be any changes in data.
Hub	`INSERT`	Skipped
Link	`INSERT`	Skipped
Satellite	`INSERT`	Skipped. A business vault satellite should rather be truncated and reloaded, if a business rule changes.
Dimension	`INSERT`	`UPDATE`
Fact	`INSERT`	Nothing, `DELETE` & `INSERT` or `UPDATE` (depending on the way you work with facts and what is faster)

Persistent Points

In the diagram above you see some points in the pipeline. The whole pipeline could work without interruption if your database system can handle all the rules in a timely manner.

I discovered to persist data into a temp table before hashing, increases performance. SQL Server has sometimes odd behaviors. Calculating 100m hashes and then taking the newest 100.000 in a delta load. Very bad.

The next point is the turning point from source to target. Having a temporary object helps with loading several targets such as Hubs, Links and Satellites in parallel.

It can help also with the next step: To Update timelines.

Post-Process Update Timeline

This step is only needed, if we have persisted timeline end dates. The end date of an existing record changes from ‚9999-12-31‘ to the start date of the new record.

If we have a temporary object from above, we only need to update the changed timelines for a given Business Key. All others can be skipped.

Summary

Quite a change from my last blog post about Universal Loading Pattern. It has a lot more context now.

For me the biggest break-through was to emphasize on Business Keys and Surrogate Keys for any given object. Everything else will follow. And using hashes is just a technical necessity. Nothing else.

I hope it will help somebody designing loading procedures for any Data Warehouse Automation solution. Or at least being a checklist, if you have thought about all transformation rules to load data to a target.

By ecki • Business Intelligence 0

Aug. 10 2019

Data Warehouse Naming Conventions

As the amount of objects in my data warehouse is increasing, structuring them became an important topic. The goal is, that also new employees will grasp the content immediately.

There are different naming conventions on the internet, but they usually just tell about prefixing objects. This is probably the easy part. How about structuring hubs with 100s of satellites?

There is no general rule. I believe, that most other developers like to develop their own style. And when you have a look into the data warehouse maybe you can identify who was working on them. Like people who can tell, which code came from which hacker or virus programmer.

So this is my style!

Projects

In WhereScape I work with projects. A project is just a collection of objects. And these projects can be grouped.

My project numbering follows the value flow. It starts with the load from the source and ends with the export to destinations.

I understand that there is also different approach to have a more holistic approach, combining all steps into one project, and deploy that as a whole. This didn’t work for me. If we would be a huge team, I would reconsider that.

Step	Project	Description
1	Load	Landing Area. Import all external data into this area
2	Data Store	Persistant Staging Area. Historize all data from the source ever received.
3	Data Vault	Build Data Vault Objects like Hubs, Links and Satellites. I don't separate Raw & Business Vault. The lineage will tell what is raw and what is calculated.
4	Data Mart	Build Data Mart Objects like Dimensions, Facts and Aggregations.
5	Reports	Reporting Area managed by BI
6	User generated objects	Reporting Area managed by the User
7	System objects	Preparing the data for any platforms we manage like SQL Analysis Server, Elastic Search. Just Views.
8	Exports	Exporting Data to any foreign system. Historize all data sent.
9	Meta	Meta data, maintenance, script collection, etc.

The list above are just the main nodes. The next 1-n digits are used to order the projects according to the execution steps.

Databases

Consequentially I use the naming convention also for my databases.

Database Name
`bi_10_load_{partner/system}`
`bi_20_data_store_{partner/system}`
`bi_30_data_vault`
`bi_40_data_mart_{area}`
`bi_50_report_{area}`
`bi_60_user_{department/user}`
`bi_70_system_{name}`
`bi_80_export_{partner/system}`

Usually I have a main database without a suffix. If I see that the source or the area will get quite big, I separate them.

Objects

Here a list of how I name my main objects:

Object Type	Naming Scheme	Description
Load	`load_{partner/system}_{table}_ {version}__{group}`	Landing area. For file loading I add a version to it. See my blog For loading data in parallel from multiple same looking databases, I use "group" to separate them.
Data Store	`ds_{partner/system}_{table}`	Import oriented history
Data Store	`ds_exp_{partner/system}_{table}`	Export oriented history
Hub	`hub_{business_concept}_{bk_space}`	Depending on the way you save business keys (concatenated or composite) you might come into the situation to have multiple same hubs with different business keys. Separate them by business key space (it is not the source system!)
Link	`link_{hub}_{hub}..._{bk_space}`	Hubs are sorted by size and importance
Satellite	`sat_{hub}_{content}`	Hub Satellite. I don't separate them by raw or business vault. Content rules! If I'm interested what is raw and was calculated I follow the lineage.
	`lsat_{link}_{content}`	Link Satellite
	`lsat_{link}_eff`	Link Satellite for effectivity
PIT	`pit_{hub/link}`	Point-in-time object
Bridge	`bridge_{content}`	Bridge object
Dimension	`dim_{dimension}_{area}`	If there are more similar dimensions, separate them by area
Fact	`fact_{fact}_{area}`	If there are more similar facts, separate them by area
View	`{object}__{description}`	Views for specific purpose
	`{object}__current`	Shows the current active record of a business key (e.g. Data Store)
	`{object}__first`	Shows the first value of a business key
	`{object}__last`	Shows the last known value of a business key (e.g. Data Store, Satellite) In contrast to current, a record could be marked as deleted.
	`{object}__timeline`	Shows the timeline with start date and end date based on dss_load_datetime or dss_event_datetime
	`{object}__infinity`	Similar to timeline but the first start date is 0001-01-01
	`{object}__cube`	Specific transformation for cubes
	`error_{object}__{rule}`	Error views to automatically QA the data inside the object or compare against former objects for completeness of data. Are checked daily.
Stage	`stage_{object}__{step_id}_{step_name}` `stage_{object}__{step_id}_{step_name}__ {sub_step_id}_{sub_step_name}`	Temporary tables and views to organize the calculation steps towards the target. Next time I will rather name that `temp` instead of `stage`. All steps report results to the upper level. Sub-steps are kind of sub-routines to report back results to the main steps
Source Mapping	`{object}__{source_object}`	WhereScape specific object to map multiple sources to one target.
Procedures	`update_{object}`	Default unmodified procedure created by automation solution. Updatable.
	`custom_{object}`	Modified procedure created by developer
	`user_{}`	User-defined procedures for shared functionality
	`meta_{}`	Meta data procedures

Double-underscore „__“ have the meaning of being „interpretation“ or „steps“. The main object name has only dashes to separate the words.

Columns

And here is the list about all my columns and system columns used. Quite a lot.

Column Name	Description
`link_{link}_key`	Link Hash Key
`hub_{hub}_key`	Hub Hash Key
`dim_{dimension}_key`	Dimension Key
`business key columns`	Business Keys. No rule.
`data columns`	Data columns. No rule.
`event_datetime`	If we need to persist all events (also with no changes in data). Here it will get into the dss_change_hash and therefore persisted.
`is_active`	Used as a data column in effectivity satellites. A satellite without content doesn't exist in my system.
`dss_data_set_type`	Used to help to process FULL, DELTA and CDC loads into Data Stores
`dss_event_datetime`	The event datetime can also be used in the data area. Here it is used to automate the timeline calculation
`dss_effectivity_datetime`	Calculate the start date based on dss_event_datetime
`dss_expiry_datetime`	Calculate the end date based on dss_event_datetime
`dss_load_datetime`	The first time the record appeared in my data warehouse. See my blog.
`dss_sequence_no`	Numbering of a loading batch. Either on the way in or calculating after landing the batch. Persisted in Data Stores to form a primary key, to be added as a sub-sequence key in satellites (=multi-active) or alternatively manipulate in edge cases the dss_load_datetime in satellites.
`dss_record_source`	Where does the data come from. {partner}_{database}_{schema}_{table} {file path}
`dss_change_hash`	Change Hash of all data columns. Easier to compare only 1 columns with the target if something has changed than comparing all columns with the target.
`dss_start_datetime`	Calculate the start date based on dss_load_datetime
`dss_end_datetime`	Calculate the end date based on dss_load_datetime
`dss_is_current`	Flag a record as current or old in Data Stores
`dss_is_active`	Flag a record as active or deleted in Data Stores
`dss_create_datetime`	Datetime when a record got inserted into the table
`dss_update_datetime`	Datetime when a record got changed in the table

The list above is also my sort order. The system sorts automatically the columns accordingly.

Templates

And for those who work with templates and automation, here my list of naming conventions:

DDL	`{company}_{technology}_ddl_ {table/view}_{object}__{description}`	A DDL generator is always technology dependent. It creates views or tables. Objects are from above and descriptions are all those different views, such as `last` or `current`
Procedure	`{company}_{technology}_procedure_ {object}_{special_case}`	Creates loading procedures to load data into objects. Stored procedures are technology dependent.
Script	`{company}_{technology}_script_ {object}_{special_case}`	Creates loading procedures to load data into objects. Scripts are usually technology dependent. They might be written in PowerShell, Python or other languages, but they load data to a specific environment. Although they might all talk SQL, they behave different and can be optimized differently.
Block	`{company}_{technology}_block_ {command}_{topic}`	Reusable patterns to build DDLs and procedures. Commands are SELECT, INSERT, UPDATE or DELETE.
Utility	`{company}_utility_{topic} {company}_{technology}_utility_{topic}`	Library to support all above. I have configuration, variables, elements, snippets, metadata, ddl, procedure, etc. Both technology agnostic and technology dependent.

I hope, somebody finds that useful. If you have more stuff to add or have another approach, please get in touch. I’m interested in learning your approach.

By ecki • Business Intelligence 0

Apr. 13 2019

The Hub-lemma (a.k.a. Hub Dilemma)

When I started to learn and work with Data Vault, I encountered the brilliance of hubs. It has taken me some time to adapt to the idea, that a hub can be loaded from more than 1 source system – if the Business Keys are well defined.

Then I got my training and I still see a slide with my inner eye where a whole pipeline of source systems along a business process are pumped into 1 hub. With different business keys and structure.

This was very intriguing. Design an industry data model and save your data in satellites around it! What a break-through!

But… with implementation and working with the data models, questions started to arise, and solutions had to be found.

At first when I started to model my hubs, I started with what was lying in front of me. Let’s take an easy example.

We are selling nutrition products over the internet. We have multiple web shops. In every country 1. In Switzerland 2 because of multiple languages.

The source system has for each web shop a database. And customers who are buying products are saved in the database which the web shop is depended on. This means if I move to another country, I have to register again. I don’t question the model, in BI we use what we get.

So how to ingest data? I have been extensively evaluating which business key is best. As I was teached in my certification class, I tried to avoid the usage of an id but failed to find a better solution. Finally, I went with shop_code and customer_id.

So, I created a hub with the following columns:

hub_customer_key
shop_code
customer_id
dss_load_datetime
dss_record_source

After I have been developing our model for the source system of our shop, I started to load data from our email campaign provider. They have also an id as primary key. But the better business key is email.

For each shop we have an account. It was easy to reuse the existing business key columns:

shop_code = account (same as web shop)
customer_id = email address

The only thing I had to change was the data type of customer_id. From INT to NVARCHAR. Everything was smooth.

(Of course, I could have chosen to create another hub and name it hub_recipients, but didn’t)

On top of that I built a same-as-link to travel from one system to the other.

On the horizon is another source system, but this doesn’t match the existing business key definition.

So, this is the first Hub-lemma.

Hum-lemma 1: Composite vs Concatenated Business Key

Although the hub definition matches, the business key looks different among source systems. How to model and ingest them?

Option 1: Create separate hubs

If the existing business key definition for a hub doesn’t match, create separate hubs.

hub_customer_a

shop_code
customer_id

hub_customer_b

email

The advantage is, that you have a clear view and your design doesn’t conflict with existing structure.

Option 2: Concatenate business key

Instead of saving each business key column in its own column, some are advocating to concatenate the business keys into 1 column, separated by a separator. E.g. DE|1234 as shop_code|customer_id.

The advantage is that a hub looks always the same. A hash key and ONE business key column. Great for automation.

The disadvantage is, that you need to load the business key columns into a separate satellite. Otherwise the business key is not reachable, and a need arises to decompose the hub business key, which is slow and bad.

Hub-lemma 2: Overlapping business keys

We have found now ways to load „not compatible“ business keys into hubs. What happens, if of a sudden a business key of a hub overlaps meaning? The easiest example is, customer 1 of source system a is somebody else than customer 1 of source system b.

This „threat“ had been nagging me since the beginning.

Option 1: Separate by connection

Maybe you remember my last blog post about structuring files? Some of these elements can be reused.

If you have same looking databases from a source system, use the connection name as a separator.

As you read above, when I wrote about my journey, I found the same answer. I added shop_code to separate my customers from each other. In that way the customers don’t overlap.

Option 2: Separate by source system

If the source systems are really disparate and you believe no better business key can be found, then adding a source system name to the business key could be an option.

I don’t like this option. If I add the source system name as a prevention measure to everything, the whole concept of data integrating with data vault leads into absurdity and makes things over-complicated.

Option 3: Multi tenancy

What is multi tenancy. The Wikipedia description adapted to data means:

Multi tenancy refers to a data architecture in which a single instance of a business key saved in a hub serves multiple tenants.

This is nothing else than adding to every hub by design another column for the tenant.

In contrast to the previous option, its not about multiple source systems, its about multiple customers if we are selling our model to clients. We would have a management system to maintain the tenants and collect meta data about them and features we offer to customers which are enabled or disabled.

Option 4: Separate hubs

Instead of combining everything into 1 hub, building separate hubs would be an option:

Instead of saying that the suffix is the source system name, I would rather say this is the „business key space“. If another source system reuses the existing business key (e.g. email), there is no need to add another source system name.

Option 5: Separate satellites

I must admit, that I came up with this option when I searched for solutions against the threat of overlapping business keys long time ago. I had a very strong belief to not having multiple hubs with same meaning. I can’t remember who implanted that into my thoughts. Probably me ;-). My focus was to have only a few well-defined hubs and integrate any data into them.

The naming scheme would be:

sat_{hub}_{space}_{content} (Raw + Business Satellites)
sat_{hub}_{content} (Unified golden record)

My understanding was, that a business key by itself has no meaning. To give meaning I would always need to join a satellite to it. On a resent meetup I learnt, that others believe a single hub entry is key and you should be able to join every satellite to it.

With that option, it could happen that a hub business key has different meaning depending on the satellite I join. Usually I join only satellites with the same „business key space“ together. If I need to join data from another space, I would create a same-as-link and combine data sets from both sides and use the hub twice.

The advantage is that it looks clean and tidy. I see immediately from the satellite’s name which data I can join together. And there is no discussion about hub suffixes.

The disadvantage of that design is, that objects are getting clumsy and you need to work wisely on joins.

Maybe not the best design when I think from today’s standpoint and what I learnt so far.

Hub-lemma 3: Surrogate Key

A surrogate key is an alternate key to access a record. Maybe you have seen in dimensional modeling a dimension key which was created on a dimension and added to facts for faster and easier lookup. This is a surrogate key.

Option 1: Data Vault „0“

Dan Linstedt’s definition of a hub is: a unique list of business keys

There is nowhere written that a surrogate key is required to get a hub running.

In his certification course he told the story about the invention. The initial data vault model had no surrogate keys.

Option 2: Data Vault 1

At the beginning of this millennium, Dan Linstedt published his paper about Data Vault. There he wrote about the usage of an INT surrogate key similar to dimensional modelling.

This pattern is very fast on joining. But slow on loading.

First, not existing hub business keys need to get loaded. Then those newly created surrogate keys can be extracted to form the business key for a link load which in turns provide a new surrogate key to be used in satellites. The orchestration of this loading pattern seems to me quite challenging.

I have to admit that I don’t have hands-on experience with this. Just from documentation and books.

Option 3: Data Vault 2

The sequential loading issue had been addressed when Dan published Data Vault 2. The INT surrogate key was replaced with a hash key.

Suddenly we are able now to load all targets in parallel. A huge speed improvement.

The only caveat with hash keys is hash collisions. A hash collision is when suddenly 2 business keys generate 1 hash key. I don’t have one at hand but to make it clearer:

„ABC“ => 0A4561…
„ZXY“ => 0A4561…

In this fantasy model both keys generate the same hash value.

The chance that a hash collision will occur is very very very VERY rare. A hash key is used with one hub. It is not used system-wide. So the hash collision could only happen in a hub. I remember reading somewhere, to get a 50% chance of hitting a collision based on MD5, I would need to insert every second 600.000 records into 1 hub for the next 50 years.

Not on my system. Okay, I’m safe to use it!

Option 4: Natural Key

Back to the roots! Data Vault „0“. This topis was raised in a blog post from Roelant Vos. And I just stumbled over a blog post of Dan about this topic.

As there is no required need for any surrogate key, some want to move back to the origins. From a users perspective this has some benefits. If I look into a satellite and seeing a meaningless INT value or hash key, I always need to join a hub to see the business keys.

If you go with a concatenated business key option from above this would make it quite a good combo. Having a look into a satellite and immediately understanding the business key.

Just a word of caution. Using standard „text“ on JOIN is not always safe. Especially on databases which have collations which translate and sort characters at their own discretion. E.g. é = e or ss = ß.

But who will have such keys consisting of those strange characters anyway?!? I know, nobody… Except us. We bid on keywords and people use these strange characters. I’m pushing for 100% being safe for anything. If you go for that, create the business key columns with a binary collation. Then you are safe.

Performance wise joining NVARCHAR(100) (with 200 bytes) takes a lot more resources than joining a binary MD5 hash key with 16 bytes or INTs with 4 bytes.

Summary

The hub-lemma is a broad topic. By analyzing it, we have different options. Choose which patterns help you most and consider, what you leave behind. There is no perfect solution. And if you ask for advice, everybody’s favorite answer is „it depends“ ;-).

By ecki • Business Intelligence 1

März 30 2019

OPTION(RECOMPILE)

Again, a technical topic for SQL Server. At the beginning as an analyst you are going to learn to write standard SQL queries to get your desired results.

With time your skills evolve, and you learn how to optimize the queries, maybe using a temporary table or a common table expression. The more you learn, the more you are going to use. And suddenly it appears, as if your queries are stuck. You rewrite parts of the query, try to ease the pain of the SQL engine.

And some when you come across OPTIONs, which can be added to your table, your joins, your query or your stored procedure. It will tell the SQL engine how it should optimize the execution plans. Reading and optimizing execution plans is a huge playing field. I’m far from being an expert in that field. I would like to discuss and explain certain behaviors I discovered in BI and BI only. No OLTP optimization or single record stuff. In BI I work with data sets.

As you have discovered, my loading functions are all created with Stored Procedures. My typical layout looks like this:

CREATE PROCEDURE [update_stage]
AS

  -- Initialize
  TRUNCATE TABLE [STAGE]

  -- Temporary Table for Hash Calculation
  CREATE TABLE #temp
  INSERT INTO #temp
  SELECT *
  FROM   [SOURCE]
  JOIN   ...

  -- Insert new records
  INSERT INTO [STAGE]
  SELECT *
        ,HASHBYTES(...)
  FROM   #temp

This is my typical layout. The temporary table is needed, because sometimes SQL Server doesn’t know if it should calculate 100 Mio. hashes and filter after that the needed 100000 or vice-versa.

When I compile now the stored procedure, SQL Server tries to already create an execution plan which is saved with the stored procedure. If tables are around, it gets the counts and other meta-data and tries to figure out the best way to process.

Usually in my business vault satellites, the first load will be the biggest. The following loads will be delta. So, SQL Server sees all the data and thinks, wow, a lot to process. Well, THIS execution plan would be the best to process THAT lot of data.

On the following loads, it sees the data and the already existing execution plan and thinks, okay, we have a plan, so let’s get it done. It takes the assumption of the previous run and applies it to the small delta load. And this can lead to longer running queries or even „hanging“ queries. Again, I’m not expert, I just observe.

Now comes the OPTIONs into play. In particular OPTION(RECOMPILE). If you google for information about it, there are plenty of voices who suggest, to not use it. Instead improve your query writing skills and build better indexes and so on. What they don’t mention is, that they are writing those queries for OLTP application. There is speed king. If you have tons of transactions and need every time to RECOMPILE, it is a waste of time. Totally agreed.

But in BI we have little bit more time. At least on my system. Still needs to be fast. But I don’t need to update my objects every minute. I’m willing to wait a second to get a decent execution plan. Longer it hardly takes.

So, what options do we have:

RECOMPILE Stored Procedure

CREATE PROCEDURE [update_stage]
WITH RECOMPILE
AS
  ...

WITH RECOMPILE on stored procedure level is a very drastic measure. Always recompile everything. Each and every piece. Therefore, no cache and no stats. No way to sniff what SQL Server has actually done when the procedure did run.

RECOMPILE Query

CREATE PROCEDURE [update_stage]
AS

  -- Temporary Table for Hash Calculation
  INSERT INTO #temp
  SELECT *
  FROM [SOURCE]
  JOIN ...
  OPTION(RECOMPILE)

  -- Insert new records
  INSERT INTO [STAGE]
  SELECT *
        ,HASHBYTES(...)
  FROM #temp
  OPTION(RECOMPILE)

As you can see, each statement has its own OPTION(RECOMPILE). In that way, we are still able to analyze the stats afterwards – if you ever want to.

It is just nicer to still have some control than using the club-like option WITH RECOMPILE on stored procedure level.

Conclusion

Since I have introduced this measure to my procedures, I’m not surprised anymore with waits or „hanging“ queries.

By ecki • Business Intelligence 0

Nov. 10 2018

How to organize a Raw Data Lake

How many times did you hear in the last few years the words „Data Lake“? And how many times have you asked yourself how it should be organized? And how often did you ask Google for more information and didn’t get any valuable results and only funky buzzwords? I did a lot and found not enough answers.

Over the time, I figured out myself an optimal design for organizing the very first layer of a data lake or in old school language a ‚file repository‘. And recently I completed it by asking the experts at a DDVUG meetup, where Michael Olschimke showed us their solution. Thanks Michael.

So, what layers have I found:

/source

This layer represents the data provider. If it is from an external partner, I name it after him. If it comes from an internal system, I name it after that.

It could be also the technology, which can be the master of all data following. E.g. mysql, kafka, etc.

There are also tons of files which are produced by my company and don’t fall into any category. E.g. Excel-Files, manually maintained data, etc. I set the structure above anything, so we as a company are a source too.

/connection

If there is a chance of having multiple connections for the same source, add a connection layer. If there is the slightest chance of having another set of credentials, separate it.

/operation

Usually I start collecting data and process them. But some when in the future I start to pump data in the opposite direction. So where to put these files? Could be a completely different file repository. Or it can be combined in this repository.

This is the place to specify the kind of operation we do. I have:

Import
Export
Update (log files of operation on the source system I do)
…

/object type

This layer could be used to group the source objects into a structure. E.g.

tables
meta
log
…

Could be also a directory structure like database_name/schema_name if data from SQL systems are to be saved.

/object

This is the data object. In data warehousing this is the table name. Another way to describe it, is to think about all other following directories and files below summed up at this level. If a file doesn’t match the object definition or object meaning, create another one.

I had difficulties to find a proper word for that layer. Maybe content is more likely. Choose your favorite.

Examples are ‚customer‘, ’stock‘ or something like ‚marketing_cost‘. You get the point.

/data set type

What happens if we have different data sets for the very same object?

I have seen the following data sets so far:

Full
Delta
CDC

Full means that we get the files every day in full. E.g. a customer table export from an application. Or an Excel-File which is modified by the users.

Delta means that we only get changed records. E.g. log files, transaction files.

CDC (Change Data Capture) is a very detailed transaction log file of a database having all INSERTs, UPDATEs and DELETEs of the source system. Usually it is ingested directly from a database to another database.

If the rows in a file are just data, why do we need to separate Full and Delta at all? The only reason for separating them is, if we need to detect deleted records in the source system. Then I’m able to create a status tracking satellites or in non-data vault systems to delete the record or mark it as deleted.

The difference between Delta and CDC is, that CDC contains information about deleted records.

/version

Version? What is a version for? This principle I derived from APIs. E.g. Google API evolved from v3 to v4. What happens, if the source suddenly decides to modify the file structure? E.g. translating all column names from German to English? Or modifying from csv to json? Adding columns is not a new version. The loading procedure should cope with that.

The main reason for a new version is, if the existing loading procedure can’t process the file. Then add a new version and write a new loading procedure.

My default version is ‚1‘. I have only a few objects were a ‚2‘ had to be created. But I want to be prepared! Modifying an existing layer after a need occurs, is not future-proof. Tweaking loading procedures to understand were something ends and something new starts is bad.

/load_date

If you are going to model your data lake or file repository to build data vault objects, this layer is a must have. You’ll want to preserve, when the data first appeared in your data lake. The definition of the load_date can be found here.

All others can probably omit that.

/group

Sometimes I need to ingest data by different characteristics. E.g. per company, per shop, per account, per machine, per anything. Per day is not a valid group (see below in event date).

Usually the differentiator is not part of the file name but part of the directory structure of the source system. Here is the place to preserve and save it.

The semantics of a group and connection could be kind of equal. Sometimes it makes sense to use it as a connection above or as a group here.

Note to myself 1: Why not placing group before data_set_type? Because we need a loading procedure for each data_set_type. Groups and the rest can get looped through.

Note to myself 2: Why not placing group before load_date? Could be, but a load_date should be a single entity. It serves better for incremental loading instead of having it inside each group. A group is data and data comes at the end.

Note to myself 3: Why not placing group after event_date? Sure, feel free.

/event_date

First question: Does there exists an event date as content inside the file? If yes, all good! No need for an extra layer. Nevertheless it could be created in a Hive environment for speed purposes to partition the files.

Second question: Should the file have an event date? If yes and we don’t have it, we need to add this somehow to the path.

Maybe the file has a string in the filename like ‚YYYY-MM-DD‘ or an integer value representing a timestamp. Then add this information as an event_date.

If there is nothing available, use as a last resort the create datetime of the file.

Structure it as /YYYY-MM-DD or /year/month/day. As you like.

/missing

Do we still miss something? If a key information is missing inside the file, add it as a layer. Manipulating files to add a missing column is a no-go!

The only way possible to add more information is to create them outside the file content. Maybe as another layer.

Actually group and event_date from above are just examples of missing data.

/file.ext

Finally, our source file.

If you ever need the create datetime of the file, preserve it by renaming the file. It doesn’t survive the copy process to any cloud platform and is lost.

And don’t give too much meaning to a filename. To extract data from the filename is difficult and needs precise instructions, where to look for what. No automation.

A file represented by a filename is just data!

Summary

All these components can be used as directories or if your system supports it, as meta data to the files. If you use meta data, you will probably find and use even more tags to describe your files.

Choose which elements from above are important for you. In a directory structure, the full string will look like this:
/source/connection/operation/object_type/object/data_set_type/version/load_date/group/event_date/file.ext

A very very long path and people might get lost. But computers will like it, as many pieces can help to automate the loading procedures. Some systems are even suggesting to use key=value as a directory name, e.g. year=2018 or shop_code=de. This makes it even easier to automate.

The absolute minimum structure would be in my opinion:
/source/object/version

If you believe that a table with 2 versions could be automated too, then the 2nd version is not of importance ;-).

For ingesting data, we need a procedure for each combination of
/source/object/data_set_type/version

If you believe that there are way too many directories and you don’t need all these nifty things, don’t use load_date and modify the filename to include the omitted layers:
{group}__{event_date}__original_file_name.ext {group}__{event_date}__{object}.ext

Please tell me your solution, if you believe I missed a case.

Update 16.08.2020

In the last two years I have been working with that model. So far I didn’t miss any new type and I didn’t find any new layer. It was a joy working with that. Every file found easily its place.

What I discovered was that I would merge the layer data set type and version to keep version. While it was clear that a certain data set type in the file is full I changed it to delta while loading. And I found a new data set type: Multiple full. A inventory file which consisted of multiple full data sets, each for a different date.

Therefore the data set type is deferred to how the data is transformed to Data Store or Data Vault.

By ecki • Business Intelligence 0

Okt. 21 2018

Praise be to the LoadDate

The best invention in my Business Intelligence career is the adoption of the concept of having a LoadDate.

This should be a post to give credit to Dan Linstedt and thank him for inventing Data Vault and introducing LoadDate.

When I encountered the hub-and-spoke concept of Data Vault I thought, why on earth do I need a LoadDate? Combining in a Satellite the HubHashKey with the LoadDate to form the Primary Key makes sense. But why would I need it in a Hub or a Link? Just to know when a Hub entry was first introduced to the data warehouse? The usage is universal and far reaching!

What is the LoadDate? The definition is:

„The LoadDate is the date and time of insert for the record“

It is the moment when a record is first inserted into a Data Warehouse. It is not the extract date from the source or the current datetime in the source system. It is really the INSERT datetime in my Data Warehouse! The first time a record exists.

How to ingest it:

Save the value in a variable and insert it with the data set into the load table
Set a default constrain like SYSDATETIME() on the LoadDate column

What about when the insert takes a long time with a lot of data? Do I get a LoadDate for every second it loads? This would be very precise. In my world one LoadDate is one batch. One batch could be the load of one file. The next file has a new LoadDate. When loading data in a loop of something, each loop has a new LoadDate.

When the record has been inserted into the load table, the LoadDate stays with the record. Where ever it goes! Like a tracking marker.

The LoadDate is used in:

Load
Data Store
Hub
Link
Satellite
Dimension
Facts

I use it everywhere!

In a Data Store (aka Persistent Staging Area) I use it to order the data and batch-load the data to the target. Together with the SequenceNo it forms also the Primary Key.

In Link and Hub I know when the first appearance of a business key happened.

In Raw Vault Satellites I use it to load data incrementally from the Data Store.

In Business Vault Satellites where multiple source tables are joined together, which LoadDate succeeds? If history is important, I use it to create mini-Point-in-Time tables. If history is not important, I use the most recent record and calculate the MAX value of all LoadDates.

SELECT *
      ,(
         SELECT MAX(v)
         FROM (VALUES
                 (table_1.dss_load_date)
                ,(table_2.dss_load_date)
                ,...
              ) AS VALUE(v)
       ) AS dss_load_datetime
...

In that way I can use delta loading and it fits perfectly to my Universal Loading Pattern.

How can I use it in Dimensions & Facts? The main question in those objects is, when do I need to UPDATE columns with new data? When I was researching the best way, I used first to calculate a ChangeHash. Very time consuming! After detecting the formula above, I thought, why not use the LoadDate for change detection? If for a given Business Key the LoadDate changes, chances are high that the record needs an UPDATE.

With this approach I’m much faster now. And I’m also faster in preparing the data in a stage table as I can use the LoadDate for the delta loading layer too.

Are there any other similar concepts? In SQL Server one could use rowversion (altered from timestamp). It is a system-wide incremental number saved as a binary. But copying data into the next table will create a new rowversion except you extract the value and save it as a normal value. In my view not easy to handle and not readable. How do you know, when record 1234 was inserted?

Dan Linstedt: Thanks for introducing LoadDate to the Business Intelligence Community and forcing users to use it correctly. It is of tremendous help! I use it in batch loading, delta loading and change detection (only in Dimensions & Facts). I couldn’t do without it anymore. Thanks a lot!

By ecki • Business Intelligence 1

Sep. 1 2018

My indexing journey

The unprepared

From my background as an accountant and controller I had by nature a good feeling for numbers. This helped me to understand, how to structure data.

A couple of years ago when I started to work in a startup, I usually grabbed data directly from source systems to support our cause to tackle financial issues and finding process deficits.

Over time the queries grew in sizes and run-time got longer. When the run-time was not any more satisfying and our excel sheets with tons of formulas and recalculations broke, the need arose to build something more reliable.

First, I created queries and jobs which persisted data in advance. Just a bunch of reports. But we wanted to give these data also to users. So, we started to build a data warehouse with dimensions and facts and giving access to them with a nice tool. My data journey from Kimball Star Schema to Data Vault is for another blog post. I want to focus here on indexes.

My first objects had no indexes. When performance got degraded, I started to investigate. In SQL Server, there is a feature called „Database Engine Tuning Adviser“. I would consider its suggestions and create indexes and statistics.

Time passed and processing time was growing again. At one time, I analyzed my created indexes and thought, what would happen, if I would drop all indexes? Wow, the processing time decreased from 3.5 hours to 2 hours.

Lessons learned: Don’t create too many indexes. Index creation uses also processing time.

Proper Indexing

My next adventure was about clean data. From my background as an accountant, I want to be sure that my numbers are always correct. My nightmare situation is, when data is not correct and numbers, especially financials don’t match.

Sometimes my Dimensions, Facts or Stage tables would create cartesian products (duplicates). The solution was to make sure, that I would never ever get duplicates. I started to use UNIQUE indexes.

The confidence in my data grew, because any error would alert me.

The downside of it was, that when something broke, parts of my whole pipeline stopped at that stage. If the processing time is 6 – 8 hours and users want data at 9, what happens, if my processes breaks in the night?

We set up a shift plan among our team, so that first thing after getting up is, to check emails on the phone if everything runs smoothly. This would save time until my users get their data. Not a satisfying solution, but with indexes breaking the load, what should I do?!?

Lessons learned: Indexes are great for data quality. But can break loading processes.

Exploring Cloud Databases

At a fast-growing startup, available data explodes. We got into a habit to regularly replace our SQL server to accommodate our increasing data need.

We figured out fast, that always increasing the size of our server is not satisfying. At that time Hadoop with MapReduce made its appearance. Just adding commodity hardware. No high-performance expensive single-server anymore.

I’m a SQL guy. I had a trauma replacing all my code somehow with MapReduce.

To leverage that, I started to analyze some cloud offerings such as Redshift and others.

When testing those analytical platforms, I discovered that none of them supported constrains. No primary key, no UNIQUE indexes. Even nowadays Azure SQL Data Warehouse does not support it. Their primal goal and use is to analyze a lot of data fast without any obstacles.

Their answers to my issues was, that the application must make sure, that there will be no duplicates. BI has usually no application.

For my accountant’s heart this was kind of no go. How to make sure that my or my colleagues programming will be always correct?

Lessons learned: Cloud databases are not always the answer.

BI automation

One way to ensure good results is, to make sure that data processing will always be the same. This is also something that Dan Linstedt inspired me, when he showed me the maturity diagram which he translated to a data warehouse. On the highest level you would have:

BI automation
Parallel Teams
Rapid Delivery

This was something I wanted to achieve. I can say now, that I’m able to produce best results for any objects I’m loading data in. Give me more developers and all will produce great data!

Lessons learned: Using a framework to support data processing application is a big win!

No indexes

A year ago when I migrated to SQL Server 2016, I discovered a new index type. Actually, it is rather new table storing type. Instead of a row store, data is now stored in column stores. Think of saving in Excel column A, B, C, etc. instead of Rows 1, 2, 3.

Over time, I added those CLUSTERED COLUMNSTORE INDEXES to all my objects (except stage tables). The immediate effect was, that I saved a lot of disc space. Usually 10 – 20% from the former disc space is used now.

On top of that I still had my NONCLUSTERED indexes. With time and analyzing execution plans I discovered, that SQL Server hardly uses any of my NONCLUSTERED indexes. And when it uses them, the execution plan might include nasty „Nested Loops“ instead of „Hash Matches“ to join 2 data sets together.

With growing confidence in my BI automation templates, I’m now in the situation that my application makes sure, that anybody in my team produces great results and that I can remove any UNIQUE constrains.

The immediate effect was, that our processing speed increased. Why? Data is not anymore loaded into the databases logfile for checking the UNIQUE constrains before inserting the data, but immediately inserting the data without any delay. Like any other cloud database.

I’m now kind of par with the cloud offerings. There is no need to migrate to cloud because of speed issues. Any indexes, especially UNIQUE indexes, are speed blockers. They don’t have them, I don’t have them. Another benefit was disc space. The indexes are using up to 2 times the space of a CLUSTERED COLUMNSTORE table. Saving space and time!

Do I have a safeguard routine? Of course! As the BI automation solution has all the metadata, I regularly query for duplicates.

When I look back, I started with no indexes and ended with no indexes. The difference is, that with BI automation I increased data processing quality and with CLUSTERED COLUMNSTORE a new technology emerged.

From now on I don’t need to grab my phone at first sight in the morning. The reasons for breaks now are usually external dependencies.

Lessons learned: Sleep is a precious good!

By ecki • Business Intelligence 2

Juni 16 2018

Timeline calculation for Satellite views

My last blogpost about the universal loading pattern has generated a huge number of visitors to my blog. In the recent months I have been working and refining the model.

Today I would like to talk about the last part in that model. Views based on satellites. Satellites are usually built with INSERT only. Any logic interpreting the data should be deferred to a view. What kind of views do I have in our system:

First: Not used yet but possible, retrieves the first record of a satellites surrogate key
Last: Retrieves the last valid record. Could also be called „current“
Timeline: Generates a timeline for LoadDatetime and EventDatetime. EndDate is 9999-12-31.
Infinity: Generates a timeline for LoadDatetime and EventDatetime. In contrast to the previous timeline the first record in the timeline has StartDate = 0001-01-01.

Especially the last option „infinity“ has been of great help. For effectivity satellites it was important to get the timeline as wide as possible. But also in other circumstances like adding purchasing costs to a sales order line. Instead of searching for the first valid purchase cost which might be after the first sales I can be sure, that I always get a price.

At first, I had a configurable timeline where I could „overwrite“ the first StartDate. Over time it became difficult to distinguish timelines with a wide date bandwidth and those without. To make it clearer for us, we decided to create 2 separate views if needed. A traditional timeline and a wide timeline. „Infinity“ sounds great for that.

Let’s see. In our database, satellites can have 2 date columns (of course even more, but I automated those).

LoadDatetime
EventDatetime

Instead of using LoadDatetime as the first value to calculate timelines on, we use own columns to represent the timelines. This has the advantage that I don’t „overwrite“ the LoadDatetime to represent ‚0001-01-01‘. This is also the strategy of WhereScape which is Data Vault compliant.

What are the names used:

LoadDatetime
- StartDatetime
- EndDatetime
EventDatetime
- EffectivityDatetime
- ExpiryDatetime

Let’s get to the code. Easy stuff first.

First & Last Value views

One could think that calculating a timeline first and then getting the EndDatetime = ‚9999-12-31‘ would be a nice solution. Still doable, but not fast.

The fastest option I found was to calculate it like that.

SELECT *
FROM   satellite
INNER JOIN (
    SELECT hub_hash_key AS hub_hash_key
          ,MAX(dss_load_datetime) AS dss_load_datetime
    FROM   satellite
    GROUP BY hub_hash_key
) AS last_rows
ON  last_rows.hub_hash_key = satellite.hub_hash_key
AND last_rows.dss_load_datetime = satellite.dss_load_datetime

Nice, easy and fast. For the first value replace the MAX function with MIN.

Timeline & Infinity views

When calculating only one timeline from a date is easy. To calculate it with 2 date columns is just a little trickier because of odd but understandable behavior. A bi-temporal view comes some when in the future when I grasp it.

Timeline calculation uses a lot of window functions and execution plans are quite bad as it is sorting the data forth and back.

What are the requirements:

dss_load_datetime/dss_event_datetime: Stays as it is
dss_start_datetime/dss_effectivity_datetime: Takes the datetime. For infinity views it should represent the first value of the timeline as ‚0001-01-01‘
dss_end_datetime/dss_end_datetime: Takes the next datetime or if there is none ‚9999-12-31‘

What will be the PARTITION BY clauses for each timeline:

dss_load_datetime: The hub_hash_key of the satellite
dss_event_datetime: The hub_hash_key or the driving_key of the satellite

What is the driving key? The driving key is useful for effectivity satellites to force a 1:m instead of a m:n relationship. A good blog post which describes it in-depth can be found on Roelant Vos’s Blog.

From my universal loading pattern blog post, you see that I organized all logic in sub-queries. To calculate the timeline, I need 3 sub-queries.

Sorting the whole dataset by dss_load_datetime
Calculate dss_start_datetime and dss_end_datetime
Calculate dss_effectivity_datetime and dss_expiry_datetime

Sorting dataset

I use this function not only in satellites but also for calculating timelines in stage tables. Sometimes a cartesian product happens and produces bad results. In a cartesian product 2 or more lines look the same. It can happen that for the LoadDatetime timeline the first row is being used and for the EventDatetime timeline the second. Resulting in both records being returned and violating a unique constrain.

Therefore, I sort the whole dataset by dss_load_datetime and use that number in the following queries.

ROW_NUMBER() OVER (PARTITION BY hub_hash_key ORDER BY dss_load_datetime ASC) AS dss_row_no

Calculate LoadDatetime timeline

For normal timelines just take the dss_load_datetime as dss_start_datetime. For infinity timelines where we want to set the first dss_start_datetime as '0001-01-01', we need to figure out, if this record is the first or not.

CASE WHEN LAG(dss_load_datetime,1,NULL) 
          OVER (PARTITION BY hub_hash_key ORDER BY dss_row_no ASC) IS NULL
     THEN '0001-01-01'
     ELSE dss_load_datetime
END

For dss_end_datetime we need the following window function query:

LEAD(DATEADD(ns,-100,dss_load_datetime),1,CAST('9999-12-31' AS DATETIME2(7))) 
OVER (PARTITION BY hub_hash_key ORDER BY dss_row_no ASC )

As you can see, we order both queries with the dss_row_no from the previous query. As we need both window function LEAD and LAG, SQL Server will sort the data forth and back. I didn’t find another solution for that yet.

Currently I use a DATEADD function to subtract 1 time tick to being able to use BETWEEN queries. If you don’t need that, just remove the DATEADD function.

Calculate EventDatetime timelines

The calculation looks like the previous one. Except that we can sort the data by hub_hash_key or driving_key.

In contrast to satellites where we can be sure, that we don’t get any duplicates per hub_hash_key and dss_load_datetime we can get duplicates on dss_event_datetime. Instead of just sorting the data by dss_event_datetime we use also dss_load_datetime represented by dss_row_no for sorting.

The calculation for the dss_effectivity_datetime looks like the following:

CASE WHEN LAG(dss_event_datetime,1,NULL) 
          OVER (PARTITION BY driving_key ORDER BY dss_event_datetime ASC,dss_row_no ASC) IS NULL
     THEN '0001-01-01'
     ELSE dss_event_datetime
END

The calculation for the dss_expiry_datetime looks the same as for the dss_end_datetime.

LEAD(DATEADD(ns,-100,dss_event_datetime),1,'9999-12-31') 
OVER (PARTITION BY driving_key ORDER BY dss_event_datetime ASC,dss_row_no ASC)

Again, the DATEADD function can be removed.

You see that we can end up with 4 sorting functions which can have a huge impact on performance. Create 2 separate views or persist them (or not).

Happy sorting!

By ecki • Business Intelligence 1

1 2 3 4 ›»