Apr. 13 2019
The Hub-lemma (a.k.a. Hub Dilemma)
When I started to learn and work with Data Vault, I encountered the brilliance of hubs. It has taken me some time to adapt to the idea, that a hub can be loaded from more than 1 source system – if the Business Keys are well defined.
Then I got my training and I still see a slide with my inner eye where a whole pipeline of source systems along a business process are pumped into 1 hub. With different business keys and structure.
This was very intriguing. Design an industry data model and save your data in satellites around it! What a break-through!
But… with implementation and working with the data models, questions started to arise, and solutions had to be found.
At first when I started to model my hubs, I started with what was lying in front of me. Let’s take an easy example.
We are selling nutrition products over the internet. We have multiple web shops. In every country 1. In Switzerland 2 because of multiple languages.
The source system has for each web shop a database. And customers who are buying products are saved in the database which the web shop is depended on. This means if I move to another country, I have to register again. I don’t question the model, in BI we use what we get.
So how to ingest data? I have been extensively evaluating which business key is best. As I was teached in my certification class, I tried to avoid the usage of an id
but failed to find a better solution. Finally, I went with shop_code
and customer_id
.
So, I created a hub with the following columns:
hub_customer_key
shop_code
customer_id
dss_load_datetime
dss_record_source
After I have been developing our model for the source system of our shop, I started to load data from our email campaign provider. They have also an id
as primary key. But the better business key is email
.
For each shop we have an account. It was easy to reuse the existing business key columns:
shop_code
= account (same as web shop)customer_id
= email address
The only thing I had to change was the data type of customer_id. From INT
to NVARCHAR
. Everything was smooth.
(Of course, I could have chosen to create another hub and name it hub_recipients
, but didn’t)
On top of that I built a same-as-link to travel from one system to the other.
On the horizon is another source system, but this doesn’t match the existing business key definition.
So, this is the first Hub-lemma.
Hum-lemma 1: Composite vs Concatenated Business Key
Although the hub definition matches, the business key looks different among source systems. How to model and ingest them?
Option 1: Create separate hubs
If the existing business key definition for a hub doesn’t match, create separate hubs.
hub_customer_a
shop_code
customer_id
hub_customer_b
email
The advantage is, that you have a clear view and your design doesn’t conflict with existing structure.
Option 2: Concatenate business key
Instead of saving each business key column in its own column, some are advocating to concatenate the business keys into 1 column, separated by a separator. E.g. DE|1234 as shop_code|customer_id.
The advantage is that a hub looks always the same. A hash key and ONE business key column. Great for automation.
The disadvantage is, that you need to load the business key columns into a separate satellite. Otherwise the business key is not reachable, and a need arises to decompose the hub business key, which is slow and bad.
Hub-lemma 2: Overlapping business keys
We have found now ways to load „not compatible“ business keys into hubs. What happens, if of a sudden a business key of a hub overlaps meaning? The easiest example is, customer 1 of source system a is somebody else than customer 1 of source system b.
This „threat“ had been nagging me since the beginning.
Option 1: Separate by connection
Maybe you remember my last blog post about structuring files? Some of these elements can be reused.
If you have same looking databases from a source system, use the connection name as a separator.
As you read above, when I wrote about my journey, I found the same answer. I added shop_code
to separate my customers from each other. In that way the customers don’t overlap.
Option 2: Separate by source system
If the source systems are really disparate and you believe no better business key can be found, then adding a source system name to the business key could be an option.
I don’t like this option. If I add the source system name as a prevention measure to everything, the whole concept of data integrating with data vault leads into absurdity and makes things over-complicated.
Option 3: Multi tenancy
What is multi tenancy. The Wikipedia description adapted to data means:
Multi tenancy refers to a data architecture in which a single instance of a business key saved in a hub serves multiple tenants.
This is nothing else than adding to every hub by design another column for the tenant.
In contrast to the previous option, its not about multiple source systems, its about multiple customers if we are selling our model to clients. We would have a management system to maintain the tenants and collect meta data about them and features we offer to customers which are enabled or disabled.
Option 4: Separate hubs
Instead of combining everything into 1 hub, building separate hubs would be an option:
Instead of saying that the suffix is the source system name, I would rather say this is the „business key space“. If another source system reuses the existing business key (e.g. email), there is no need to add another source system name.
Option 5: Separate satellites
I must admit, that I came up with this option when I searched for solutions against the threat of overlapping business keys long time ago. I had a very strong belief to not having multiple hubs with same meaning. I can’t remember who implanted that into my thoughts. Probably me ;-). My focus was to have only a few well-defined hubs and integrate any data into them.
The naming scheme would be:
sat_{hub}_{space}_{content}
(Raw + Business Satellites)sat_{hub}_{content}
(Unified golden record)
My understanding was, that a business key by itself has no meaning. To give meaning I would always need to join a satellite to it. On a resent meetup I learnt, that others believe a single hub entry is key and you should be able to join every satellite to it.
With that option, it could happen that a hub business key has different meaning depending on the satellite I join. Usually I join only satellites with the same „business key space“ together. If I need to join data from another space, I would create a same-as-link and combine data sets from both sides and use the hub twice.
The advantage is that it looks clean and tidy. I see immediately from the satellite’s name which data I can join together. And there is no discussion about hub suffixes.
The disadvantage of that design is, that objects are getting clumsy and you need to work wisely on joins.
Maybe not the best design when I think from today’s standpoint and what I learnt so far.
Hub-lemma 3: Surrogate Key
A surrogate key is an alternate key to access a record. Maybe you have seen in dimensional modeling a dimension key which was created on a dimension and added to facts for faster and easier lookup. This is a surrogate key.
Option 1: Data Vault „0“
Dan Linstedt’s definition of a hub is: a unique list of business keys
There is nowhere written that a surrogate key is required to get a hub running.
In his certification course he told the story about the invention. The initial data vault model had no surrogate keys.
Option 2: Data Vault 1
At the beginning of this millennium, Dan Linstedt published his paper about Data Vault. There he wrote about the usage of an INT
surrogate key similar to dimensional modelling.
This pattern is very fast on joining. But slow on loading.
First, not existing hub business keys need to get loaded. Then those newly created surrogate keys can be extracted to form the business key for a link load which in turns provide a new surrogate key to be used in satellites. The orchestration of this loading pattern seems to me quite challenging.
I have to admit that I don’t have hands-on experience with this. Just from documentation and books.
Option 3: Data Vault 2
The sequential loading issue had been addressed when Dan published Data Vault 2. The INT
surrogate key was replaced with a hash key.
Suddenly we are able now to load all targets in parallel. A huge speed improvement.
The only caveat with hash keys is hash collisions. A hash collision is when suddenly 2 business keys generate 1 hash key. I don’t have one at hand but to make it clearer:
- „ABC“ => 0A4561…
- „ZXY“ => 0A4561…
In this fantasy model both keys generate the same hash value.
The chance that a hash collision will occur is very very very VERY rare. A hash key is used with one hub. It is not used system-wide. So the hash collision could only happen in a hub. I remember reading somewhere, to get a 50% chance of hitting a collision based on MD5, I would need to insert every second 600.000 records into 1 hub for the next 50 years.
Not on my system. Okay, I’m safe to use it!
Option 4: Natural Key
Back to the roots! Data Vault „0“. This topis was raised in a blog post from Roelant Vos. And I just stumbled over a blog post of Dan about this topic.
As there is no required need for any surrogate key, some want to move back to the origins. From a users perspective this has some benefits. If I look into a satellite and seeing a meaningless INT value or hash key, I always need to join a hub to see the business keys.
If you go with a concatenated business key option from above this would make it quite a good combo. Having a look into a satellite and immediately understanding the business key.
Just a word of caution. Using standard „text“ on JOIN is not always safe. Especially on databases which have collations which translate and sort characters at their own discretion. E.g. é = e or ss = ß.
But who will have such keys consisting of those strange characters anyway?!? I know, nobody… Except us. We bid on keywords and people use these strange characters. I’m pushing for 100% being safe for anything. If you go for that, create the business key columns with a binary collation. Then you are safe.
Performance wise joining NVARCHAR(100) (with 200 bytes) takes a lot more resources than joining a binary MD5 hash key with 16 bytes or INTs with 4 bytes.
Summary
The hub-lemma is a broad topic. By analyzing it, we have different options. Choose which patterns help you most and consider, what you leave behind. There is no perfect solution. And if you ask for advice, everybody’s favorite answer is „it depends“ ;-).
Jens-Peter Giersch
25. April 2019 @ 10:07
Good morning, Eckhard,
one hell of an interesting blog entry, thank you very much! And it relates to a problem zone I quite often struggle with: How to handle business keys refering to a business object from many different data sources each having its own business key range (that might overlap to the business key ranges of other data sources).
An appropriate solution for hub lemma 2 is difficult to develop. Even if I understand your statement „If I add the source system name as a prevention measure to everything, the whole concept of data integrating with data vault leads into absurdity and makes things over-complicated.“ I would take up the cudgels for option 2 if you have different data sources with overlapping business key ranges (as option 4 would require additional DDL effort to implement new data sources in a Data Vault model which in my opinion would break the advantages of Data Vault far worse than option 2). Combined with option 5 we could still align business keys from different systems by a same-as link or role-playing links…
Therefore, I plead for mercy for Hub lemma 2 option 2 🙂