How to build a Business Vault Satellite

At the last meetup I learnt that many people don’t know how to build a business vault satellite (or even that it exists!). Many are struggling to get the data into a Raw Data Vault. Persisting data from the source over time.

On the market are a few solutions, which build Data Vaults nearly automatically. According to some metadata fields from the source a data vault can be built quite easily. Some are spending a lot of time tuning model conversion rules or with data modelling. Don’t get me wrong, modelling is an important part! But when the core model is defined, matching source data to it shouldn’t take long.

Most of my time is spent on implementing business rules. Persisting data in a Raw Vault needs no time.

As you have seen with my other blog posts, I like patterns. Repeatable patterns. I have heard that there are rule engines and other fancy stuff out there. So far, I can’t imagine how such a rule engine could work. Maybe somebody can show me their solution.

Developing

Before we are going into the patterns, what kind of tools are available:

  • Graphical interfaces such as SSIS, Informatica, Talend and others
  • Programming outside SQL with Python, Scala, etc.
  • Programming in SQL

Graphical Interface

I used to work with SSIS. Graphical interfaces are nice. Especially the look at how the data is streaming through each step is very intriguing.

But I found it difficult to use with repeatable patterns. Usually when I learnt something new about how to ingest data, I had to go through all existing loading pipelines and modify the thing I learnt.

To import data from source and building raw vaults definitely a nice way to go.

Programming outside SQL

I’m sure that Python and Scala have their use case and the patterns below can be easily adapted.

If the data platform of your choice has no compute nodes, then it is the default thing to take the data out, do something and save it again in the platform.

Programming in SQL

But as I work with SQL Server, there is no need to take the data out, do something and load it again. Just a waste of bandwidth and the compute power of the server is not used. But I understand that for bigger solutions other ways must be found and used.

When working with SQL I have several choices:

  • Work with Stored Procedure aligning all calculation steps after each other
  • Work with Common Table Expression and combining all calculation steps into one statement
  • Work with Views/Tables building on each other
  • Or a combination of all above

I use templates to create stored procedures doing all the necessary steps AFTER I create the base SQL query. Therefore, I don’t want to meddle with it by modifying. This would break the maintainability.

I love to work with Common Table Expressions. At least for drafts. Fantastic to build data sets upon each other and enhancing the calculation. But when it comes to debugging, it is very cumbersome to always rewrite the SELECT statement at the end extracting data from each step.

Therefore, I like a combination of Views and Tables to separate each data set and calculation step. For debugging it is perfect. Sometimes I list all elements after each other and filter on a certain key to find out what it is doing in each step and if that is the right behavior.

Feel free to use your own method. Wanted just to share my experience.

Patterns

Just a quick note: If I write about keys, I mean all possible keys in Data Vault: Hub Hash Key, Link Hash Key, Natural Keys, Business Key etc. It just defines the grain of data I’m looking at. Is it on sales order or sales item level? Or something else?

What patterns did I find and use nowadays:

  • Define the target key
  • Define the source key
  • Get all source keys
  • Calculation
    • Get the data for the calculation
      • All values, last values or Point-In-Time
    • Define the LoadDate
    • Calculate
  • Prepare Result Data Set

Example

For easier understanding, let’s take an easy example. We have the following Data Vault Model:

  • hub_customer
  • hub_sales_order
  • link_sales_order_item (and feel free to read it also as hub_sales_order_item)

I have a raw vault satellite attached to the hub_sales_order which has a value for discount_amount. To calculate the margin on item level, I need to distribute this amount to all sales_order_items.

Define the target key

Before I start any job, I need to define what the grain of my result set should be. Maybe we have not yet an idea, how the calculation should look like, but I have to make sure, what the target grain should be.

In this example it will be a satellite attached to link_sales_order_item.

Define the source key

Apart from the calculation this is the most important step. What key do I need as a basis.

  • What will be my source data key?
  • What will be the grain of detail for the calculation?
  • Does a „higher level“ key make more sense?

Sometimes it is the same grain as the target. Like when calculating the vat_amount for each sales_order_item.

In our case the level of hub_customer would be too high. The level of hub_sales would just be right. If just a SINGLE sales_order_item changes, I want to recalculate everything for ALL sales_order_items of a given order.

Get all source keys

When the grain of the source key is defined, I collect all source keys for which the calculation is needed.

I discovered that persisting that data into a table has the best performance.

Usually I define in my „delta“ object, what my target satellite will be. The template will look up the maximum LoadDate value from it and use that value to retrieve the newest data from the source.

Sometimes I add a time window by subtracting x days from the retrieved LoadDate to make sure that I don’t miss any beat. At the beginning of my journey I had the aspiration to have only real deltas and wondered how others are doing that when you have 2 source tables while 1 lags a little behind of time. With subtracting x days of the LoadDate this issue can be taken care of. The focus is on preparing the data, but if there is no change, it won’t load to the target.

And sometimes I have multiple satellites to gather my source keys from.

To have such a „bucket“ of keys is really a great thing. I try to find as many rules as possible why something needs an update. The goal is to get all possible keys for which I need to do a calculation or an update of old calculations. E.g. for calculating contribution margin 3, I need to add payment costs. Payment costs are not easily attributable. Therefore a factor is usually needed. If management decides to modify it later and also for old orders, what can I do? Either truncate the satellite and re-calculate it again or add a rule! I usually do the later. I don’t want to think about all actions others might do. It just has to work!

Calculation

As we have now all keys to work with, we can start collecting data and do the calculation.

Get Data for Calculation

Identify all satellites, reference tables and other tables where you need to extract data from.

As all my calculations are views, I create sometimes views of all my source satellites to make it visible what data I’m extracting from where and what I need. Just for easier reading for others. And sometimes not.

Then I join all those data views to my source keys. Of course, there is sometimes the case, that I join a source view later in the calculation steps.

In my example I would join now discount_amount from sat_sales_order_discount and gross_sales_amount from lsat_sales_order_item_amount.

The result data set is now complete. All source keys are available, all needed data points for the calculation are there.

All values, last values or Point-In-Time

While joining the source data I need also to answer the question, if I need all values, last values or to create a mini Point-In-Time table.

If the source key is of the same grain as the target, for example when I clean strings of bad characters, I just take the source satellite and use each row in the calculation.

If the grain is different like in our example, I usually take the last value of the source satellites. For easier immediate access I create by default for all of my satellites last value views.

Of course, I’m not limited to the last values only. I have also other views: Last, First, Current (excluding deleted), Timeline (from timestamp till 9999-12-31), Infinite (from 0001-01-01 till 9999-12-31).

Sometimes my users request also values being calculated for each data point in the past. This means that I create on the fly a mini point-in-time table for all LoadDates since the last load.

At the beginning, I thought that I need now to create a PIT table. But maintaining a full-fledged PIT table is usually not possible. Sometimes I have multiple business rules sitting on each other. When I would create the first PIT for my source satellites, do I create a new PIT for the just updated business vault satellite? And what about the next business vault satellite? Another PIT including it? Or just updating the old one? Kind of stupid. Therefore, I calculate a PIT on the fly with only the LoadDates needed in that pipeline and only with the newest LoadDates.

For our example it is enough to extract the last known value.

Define LoadDate

Usually my calculation is based on one main raw vault satellite. Then I continue to take THAT LoadDate for my newly created business vault satellites data.

If I load from more than one satellite, I calculate sometimes the maximum value of ALL LoadDates for a given business key.

,(
    SELECT MAX(v)
    FROM  (
             VALUES
                     (satellite_1.dss_load_datetime)
                    ,(satellite_2.dss_load_datetime)
                    ,(...)
          ) AS VALUE(v)
) AS dss_load_datetime

If the calculation like the current age and age_group of a customer is so random and needs to get calculated daily, I use the current system datetime as LoadDate. If the age changes I need a new LoadDate, otherwise the new value can’t be loaded into the Satellite.

If the LoadDate is not defined correctly, updates for new results will not get imported into the target, as the primary key (hash key and LoadDate) from the old calculation already exists.

Calculate

Do whatever you want. Sometimes it is a simple calculation, sometimes many steps with intermediate result sets.

In my case, I would create a view summarizing all gross_sales_amounts of sales_order_items for each sales_order.

In the next step I would use a calculation like gross_sales_amount / sum_gross_sales_amount * discount_amount to distribute it to each sales_order_item.

If I need to round it to 2 digits, I will have also to take care about cent differences (sum of all discount_amounts of all sales_order_items versus the discount_amount on sales_order level). I would sort the rows by gross_sales_amount DESC and add the cent difference to that line with the biggest gross_sales_amount.

Prepare Result Data Set

If the calculation is looking correct and I believe I have finished, I select those columns which should be loaded to the business vault satellite, calculate the hash_diff and load the data set to the target. Sometimes I also calculate new hash_keys to create same-as-links or other data vault objects.

Before continuing, take a break and congratulate yourself for the fantastic job you did. Of course, it is 100% error free.

Bonus Task

As you have your results saved, think about ways how to ensure that the result is always correct. Here comes the Error Mart into play.

What can you check:

  • Do the same task again but in a different way and compare both results
  • Compare summary of amounts per key of source and target satellite (based on last view)
  • Compare counts of rows per key of source and target satellite (based on last view)
  • Define keys which should have a result and check if the keys in the business vault satellite exists
  • Etc.

Make it a habit that each business vault satellite has at least 1 QA view.

In our case we collect them in a separate database. We placed it next to the reports database. In my opinion it should be rather named as error reports than error mart. And I have error reports for Data Stores, Data Vault, Data Mart and Exports.

The created QA view shows only the faulty keys or rows. Everyday a job is run to count the rows in each QA view and reports a summary for all checks. Then we take actions and fix the issues.

Summary

You have seen my way of building business vault satellites. I hope you agree with the main patterns. If you have more patterns, let me now.

I have heard of business rules engines, but newer seen one. If you have one, get in touch. I would be happy to learn more about it.

Here is also a picture of such a transfer (another case):