Jan 18 2020
At the last meetup I learnt that many people don’t know how to build a business vault satellite (or even that it exists!). Many are struggling to get the data into a Raw Data Vault. Persisting data from the source over time.
On the market are a few solutions, which build Data Vaults nearly automatically. According to some metadata fields from the source a data vault can be built quite easily. Some are spending a lot of time tuning model conversion rules or with data modelling. Don’t get me wrong, modelling is an important part! But when the core model is defined, matching source data to it shouldn’t take long.
Most of my time is spent on implementing business rules. Persisting data in a Raw Vault needs no time.
As you have seen with my other blog posts, I like patterns. Repeatable patterns. I have heard that there are rule engines and other fancy stuff out there. So far, I can’t imagine how such a rule engine could work. Maybe somebody can show me their solution.
Before we are going into the patterns, what kind of tools are available:
- Graphical interfaces such as SSIS, Informatica, Talend and others
- Programming outside SQL with Python, Scala, etc.
- Programming in SQL
I used to work with SSIS. Graphical interfaces are nice. Especially the look at how the data is streaming through each step is very intriguing.
But I found it difficult to use with repeatable patterns. Usually when I learnt something new about how to ingest data, I had to go through all existing loading pipelines and modify the thing I learnt.
To import data from source and building raw vaults definitely a nice way to go.
Programming outside SQL
I’m sure that Python and Scala have their use case and the patterns below can be easily adapted.
If the data platform of your choice has no compute nodes, then it is the default thing to take the data out, do something and save it again in the platform.
Programming in SQL
But as I work with SQL Server, there is no need to take the data out, do something and load it again. Just a waste of bandwidth and the compute power of the server is not used. But I understand that for bigger solutions other ways must be found and used.
When working with SQL I have several choices:
- Work with Stored Procedure aligning all calculation steps after each other
- Work with Common Table Expression and combining all calculation steps into one statement
- Work with Views/Tables building on each other
- Or a combination of all above
I use templates to create stored procedures doing all the necessary steps AFTER I create the base SQL query. Therefore, I don’t want to meddle with it by modifying. This would break the maintainability.
I love to work with Common Table Expressions. At least for drafts. Fantastic to build data sets upon each other and enhancing the calculation. But when it comes to debugging, it is very cumbersome to always rewrite the
SELECT statement at the end extracting data from each step.
Therefore, I like a combination of Views and Tables to separate each data set and calculation step. For debugging it is perfect. Sometimes I list all elements after each other and filter on a certain key to find out what it is doing in each step and if that is the right behavior.
Feel free to use your own method. Wanted just to share my experience.
Just a quick note: If I write about keys, I mean all possible keys in Data Vault: Hub Hash Key, Link Hash Key, Natural Keys, Business Key etc. It just defines the grain of data I’m looking at. Is it on sales order or sales item level? Or something else?
What patterns did I find and use nowadays:
- Define the target key
- Define the source key
- Get all source keys
- Get the data for the calculation
- All values, last values or Point-In-Time
- Define the LoadDate
- Get the data for the calculation
- Prepare Result Data Set
For easier understanding, let’s take an easy example. We have the following Data Vault Model:
link_sales_order_item(and feel free to read it also as
I have a raw vault satellite attached to the
hub_sales_order which has a value for
discount_amount. To calculate the margin on item level, I need to distribute this amount to all
Define the target key
Before I start any job, I need to define what the grain of my result set should be. Maybe we have not yet an idea, how the calculation should look like, but I have to make sure, what the target grain should be.
In this example it will be a satellite attached to
Define the source key
Apart from the calculation this is the most important step. What key do I need as a basis.
- What will be my source data key?
- What will be the grain of detail for the calculation?
- Does a „higher level“ key make more sense?
Sometimes it is the same grain as the target. Like when calculating the
vat_amount for each
In our case the level of
hub_customer would be too high. The level of
hub_sales would just be right. If just a SINGLE
sales_order_item changes, I want to recalculate everything for ALL
sales_order_items of a given order.
Get all source keys
When the grain of the source key is defined, I collect all source keys for which the calculation is needed.
I discovered that persisting that data into a table has the best performance.
Usually I define in my „delta“ object, what my target satellite will be. The template will look up the maximum LoadDate value from it and use that value to retrieve the newest data from the source.
Sometimes I add a time window by subtracting x days from the retrieved LoadDate to make sure that I don’t miss any beat. At the beginning of my journey I had the aspiration to have only real deltas and wondered how others are doing that when you have 2 source tables while 1 lags a little behind of time. With subtracting x days of the LoadDate this issue can be taken care of. The focus is on preparing the data, but if there is no change, it won’t load to the target.
And sometimes I have multiple satellites to gather my source keys from.
To have such a „bucket“ of keys is really a great thing. I try to find as many rules as possible why something needs an update. The goal is to get all possible keys for which I need to do a calculation or an update of old calculations. E.g. for calculating contribution margin 3, I need to add payment costs. Payment costs are not easily attributable. Therefore a factor is usually needed. If management decides to modify it later and also for old orders, what can I do? Either truncate the satellite and re-calculate it again or add a rule! I usually do the later. I don’t want to think about all actions others might do. It just has to work!
As we have now all keys to work with, we can start collecting data and do the calculation.
Get Data for Calculation
Identify all satellites, reference tables and other tables where you need to extract data from.
As all my calculations are views, I create sometimes views of all my source satellites to make it visible what data I’m extracting from where and what I need. Just for easier reading for others. And sometimes not.
Then I join all those data views to my source keys. Of course, there is sometimes the case, that I join a source view later in the calculation steps.
In my example I would join now
The result data set is now complete. All source keys are available, all needed data points for the calculation are there.
All values, last values or Point-In-Time
While joining the source data I need also to answer the question, if I need all values, last values or to create a mini Point-In-Time table.
If the source key is of the same grain as the target, for example when I clean strings of bad characters, I just take the source satellite and use each row in the calculation.
If the grain is different like in our example, I usually take the last value of the source satellites. For easier immediate access I create by default for all of my satellites last value views.
Of course, I’m not limited to the last values only. I have also other views: Last, First, Current (excluding deleted), Timeline (from timestamp till 9999-12-31), Infinite (from 0001-01-01 till 9999-12-31).
Sometimes my users request also values being calculated for each data point in the past. This means that I create on the fly a mini point-in-time table for all LoadDates since the last load.
At the beginning, I thought that I need now to create a PIT table. But maintaining a full-fledged PIT table is usually not possible. Sometimes I have multiple business rules sitting on each other. When I would create the first PIT for my source satellites, do I create a new PIT for the just updated business vault satellite? And what about the next business vault satellite? Another PIT including it? Or just updating the old one? Kind of stupid. Therefore, I calculate a PIT on the fly with only the LoadDates needed in that pipeline and only with the newest LoadDates.
For our example it is enough to extract the last known value.
Usually my calculation is based on one main raw vault satellite. Then I continue to take THAT LoadDate for my newly created business vault satellites data.
If I load from more than one satellite, I calculate sometimes the maximum value of ALL LoadDates for a given business key.
,( SELECT MAX(v) FROM ( VALUES (satellite_1.dss_load_datetime) ,(satellite_2.dss_load_datetime) ,(...) ) AS VALUE(v) ) AS dss_load_datetime
If the calculation like the current
age_group of a customer is so random and needs to get calculated daily, I use the current system datetime as LoadDate. If the age changes I need a new LoadDate, otherwise the new value can’t be loaded into the Satellite.
If the LoadDate is not defined correctly, updates for new results will not get imported into the target, as the primary key (hash key and LoadDate) from the old calculation already exists.
Do whatever you want. Sometimes it is a simple calculation, sometimes many steps with intermediate result sets.
In my case, I would create a view summarizing all
sales_order_items for each
In the next step I would use a calculation like
discount_amount to distribute it to each
If I need to round it to 2 digits, I will have also to take care about cent differences (sum of all
discount_amounts of all
sales_order_items versus the
sales_order level). I would sort the rows by
gross_sales_amount DESC and add the cent difference to that line with the biggest
Prepare Result Data Set
If the calculation is looking correct and I believe I have finished, I select those columns which should be loaded to the business vault satellite, calculate the
hash_diff and load the data set to the target. Sometimes I also calculate new
hash_keys to create same-as-links or other data vault objects.
Before continuing, take a break and congratulate yourself for the fantastic job you did. Of course, it is 100% error free.
As you have your results saved, think about ways how to ensure that the result is always correct. Here comes the Error Mart into play.
What can you check:
- Do the same task again but in a different way and compare both results
- Compare summary of amounts per key of source and target satellite (based on last view)
- Compare counts of rows per key of source and target satellite (based on last view)
- Define keys which should have a result and check if the keys in the business vault satellite exists
Make it a habit that each business vault satellite has at least 1 QA view.
In our case we collect them in a separate database. We placed it next to the reports database. In my opinion it should be rather named as error reports than error mart. And I have error reports for Data Stores, Data Vault, Data Mart and Exports.
The created QA view shows only the faulty keys or rows. Everyday a job is run to count the rows in each QA view and reports a summary for all checks. Then we take actions and fix the issues.
You have seen my way of building business vault satellites. I hope you agree with the main patterns. If you have more patterns, let me now.
I have heard of business rules engines, but newer seen one. If you have one, get in touch. I would be happy to learn more about it.
Here is also a picture of such a transfer (another case):