Jan 18 2020
How to build a Business Vault Satellite
At the last meetup I learnt that many people don’t know how to build a business vault satellite (or even that it exists!). Many are struggling to get the data into a Raw Data Vault. Persisting data from the source over time.
On the market are a few solutions, which build Data Vaults nearly automatically. According to some metadata fields from the source a data vault can be built quite easily. Some are spending a lot of time tuning model conversion rules or with data modelling. Don’t get me wrong, modelling is an important part! But when the core model is defined, matching source data to it shouldn’t take long.
Most of my time is spent on implementing business rules. Persisting data in a Raw Vault needs no time.
As you have seen with my other blog posts, I like patterns. Repeatable patterns. I have heard that there are rule engines and other fancy stuff out there. So far, I can’t imagine how such a rule engine could work. Maybe somebody can show me their solution.
Developing
Before we are going into the patterns, what kind of tools are available:
- Graphical interfaces such as SSIS, Informatica, Talend and others
- Programming outside SQL with Python, Scala, etc.
- Programming in SQL
Graphical Interface
I used to work with SSIS. Graphical interfaces are nice. Especially the look at how the data is streaming through each step is very intriguing.
But I found it difficult to use with repeatable patterns. Usually when I learnt something new about how to ingest data, I had to go through all existing loading pipelines and modify the thing I learnt.
To import data from source and building raw vaults definitely a nice way to go.
Programming outside SQL
I’m sure that Python and Scala have their use case and the patterns below can be easily adapted.
If the data platform of your choice has no compute nodes, then it is the default thing to take the data out, do something and save it again in the platform.
Programming in SQL
But as I work with SQL Server, there is no need to take the data out, do something and load it again. Just a waste of bandwidth and the compute power of the server is not used. But I understand that for bigger solutions other ways must be found and used.
When working with SQL I have several choices:
- Work with Stored Procedure aligning all calculation steps after each other
- Work with Common Table Expression and combining all calculation steps into one statement
- Work with Views/Tables building on each other
- Or a combination of all above
I use templates to create stored procedures doing all the necessary steps AFTER I create the base SQL query. Therefore, I don’t want to meddle with it by modifying. This would break the maintainability.
I love to work with Common Table Expressions. At least for drafts. Fantastic to build data sets upon each other and enhancing the calculation. But when it comes to debugging, it is very cumbersome to always rewrite the SELECT
statement at the end extracting data from each step.
Therefore, I like a combination of Views and Tables to separate each data set and calculation step. For debugging it is perfect. Sometimes I list all elements after each other and filter on a certain key to find out what it is doing in each step and if that is the right behavior.
Feel free to use your own method. Wanted just to share my experience.
Patterns
Just a quick note: If I write about keys, I mean all possible keys in Data Vault: Hub Hash Key, Link Hash Key, Natural Keys, Business Key etc. It just defines the grain of data I’m looking at. Is it on sales order or sales item level? Or something else?
What patterns did I find and use nowadays:
- Define the target key
- Define the source key
- Get all source keys
- Calculation
- Get the data for the calculation
- All values, last values or Point-In-Time
- Define the LoadDate
- Calculate
- Get the data for the calculation
- Prepare Result Data Set
Example
For easier understanding, let’s take an easy example. We have the following Data Vault Model:
hub_customer
hub_sales_order
link_sales_order_item
(and feel free to read it also ashub_sales_order_item
)
I have a raw vault satellite attached to the hub_sales_order
which has a value for discount_amount
. To calculate the margin on item level, I need to distribute this amount to all sales_order_items
.
Define the target key
Before I start any job, I need to define what the grain of my result set should be. Maybe we have not yet an idea, how the calculation should look like, but I have to make sure, what the target grain should be.
In this example it will be a satellite attached to link_sales_order_item
.
Define the source key
Apart from the calculation this is the most important step. What key do I need as a basis.
- What will be my source data key?
- What will be the grain of detail for the calculation?
- Does a „higher level“ key make more sense?
Sometimes it is the same grain as the target. Like when calculating the vat_amount
for each sales_order_item
.
In our case the level of hub_customer
would be too high. The level of hub_sales
would just be right. If just a SINGLE sales_order_item
changes, I want to recalculate everything for ALL sales_order_items
of a given order.
Get all source keys
When the grain of the source key is defined, I collect all source keys for which the calculation is needed.
I discovered that persisting that data into a table has the best performance.
Usually I define in my „delta“ object, what my target satellite will be. The template will look up the maximum LoadDate value from it and use that value to retrieve the newest data from the source.
Sometimes I add a time window by subtracting x days from the retrieved LoadDate to make sure that I don’t miss any beat. At the beginning of my journey I had the aspiration to have only real deltas and wondered how others are doing that when you have 2 source tables while 1 lags a little behind of time. With subtracting x days of the LoadDate this issue can be taken care of. The focus is on preparing the data, but if there is no change, it won’t load to the target.
And sometimes I have multiple satellites to gather my source keys from.
To have such a „bucket“ of keys is really a great thing. I try to find as many rules as possible why something needs an update. The goal is to get all possible keys for which I need to do a calculation or an update of old calculations. E.g. for calculating contribution margin 3, I need to add payment costs. Payment costs are not easily attributable. Therefore a factor is usually needed. If management decides to modify it later and also for old orders, what can I do? Either truncate the satellite and re-calculate it again or add a rule! I usually do the later. I don’t want to think about all actions others might do. It just has to work!
Calculation
As we have now all keys to work with, we can start collecting data and do the calculation.
Get Data for Calculation
Identify all satellites, reference tables and other tables where you need to extract data from.
As all my calculations are views, I create sometimes views of all my source satellites to make it visible what data I’m extracting from where and what I need. Just for easier reading for others. And sometimes not.
Then I join all those data views to my source keys. Of course, there is sometimes the case, that I join a source view later in the calculation steps.
In my example I would join now discount_amount
from sat_sales_order_discount
and gross_sales_amount
from lsat_sales_order_item_amount
.
The result data set is now complete. All source keys are available, all needed data points for the calculation are there.
All values, last values or Point-In-Time
While joining the source data I need also to answer the question, if I need all values, last values or to create a mini Point-In-Time table.
If the source key is of the same grain as the target, for example when I clean strings of bad characters, I just take the source satellite and use each row in the calculation.
If the grain is different like in our example, I usually take the last value of the source satellites. For easier immediate access I create by default for all of my satellites last value views.
Of course, I’m not limited to the last values only. I have also other views: Last, First, Current (excluding deleted), Timeline (from timestamp till 9999-12-31), Infinite (from 0001-01-01 till 9999-12-31).
Sometimes my users request also values being calculated for each data point in the past. This means that I create on the fly a mini point-in-time table for all LoadDates since the last load.
At the beginning, I thought that I need now to create a PIT table. But maintaining a full-fledged PIT table is usually not possible. Sometimes I have multiple business rules sitting on each other. When I would create the first PIT for my source satellites, do I create a new PIT for the just updated business vault satellite? And what about the next business vault satellite? Another PIT including it? Or just updating the old one? Kind of stupid. Therefore, I calculate a PIT on the fly with only the LoadDates needed in that pipeline and only with the newest LoadDates.
For our example it is enough to extract the last known value.
Define LoadDate
Usually my calculation is based on one main raw vault satellite. Then I continue to take THAT LoadDate for my newly created business vault satellites data.
If I load from more than one satellite, I calculate sometimes the maximum value of ALL LoadDates for a given business key.
,( SELECT MAX(v) FROM ( VALUES (satellite_1.dss_load_datetime) ,(satellite_2.dss_load_datetime) ,(...) ) AS VALUE(v) ) AS dss_load_datetime
If the calculation like the current age
and age_group
of a customer is so random and needs to get calculated daily, I use the current system datetime as LoadDate. If the age changes I need a new LoadDate, otherwise the new value can’t be loaded into the Satellite.
If the LoadDate is not defined correctly, updates for new results will not get imported into the target, as the primary key (hash key and LoadDate) from the old calculation already exists.
Calculate
Do whatever you want. Sometimes it is a simple calculation, sometimes many steps with intermediate result sets.
In my case, I would create a view summarizing all gross_sales_amounts
of sales_order_items
for each sales_order
.
In the next step I would use a calculation like gross_sales_amount
/ sum_gross_sales_amount
* discount_amount
to distribute it to each sales_order_item
.
If I need to round it to 2 digits, I will have also to take care about cent differences (sum of all discount_amounts
of all sales_order_items
versus the discount_amount
on sales_order
level). I would sort the rows by gross_sales_amount DESC
and add the cent difference to that line with the biggest gross_sales_amount
.
Prepare Result Data Set
If the calculation is looking correct and I believe I have finished, I select those columns which should be loaded to the business vault satellite, calculate the hash_diff
and load the data set to the target. Sometimes I also calculate new hash_keys
to create same-as-links or other data vault objects.
Before continuing, take a break and congratulate yourself for the fantastic job you did. Of course, it is 100% error free.
Bonus Task
As you have your results saved, think about ways how to ensure that the result is always correct. Here comes the Error Mart into play.
What can you check:
- Do the same task again but in a different way and compare both results
- Compare summary of amounts per key of source and target satellite (based on last view)
- Compare counts of rows per key of source and target satellite (based on last view)
- Define keys which should have a result and check if the keys in the business vault satellite exists
- Etc.
Make it a habit that each business vault satellite has at least 1 QA view.
In our case we collect them in a separate database. We placed it next to the reports database. In my opinion it should be rather named as error reports than error mart. And I have error reports for Data Stores, Data Vault, Data Mart and Exports.
The created QA view shows only the faulty keys or rows. Everyday a job is run to count the rows in each QA view and reports a summary for all checks. Then we take actions and fix the issues.
Summary
You have seen my way of building business vault satellites. I hope you agree with the main patterns. If you have more patterns, let me now.
I have heard of business rules engines, but newer seen one. If you have one, get in touch. I would be happy to learn more about it.
Here is also a picture of such a transfer (another case):
Apr 11 2020
SQL Server locking methods or the fight against dead locks
After the last Index blog post here another technical issue I recently thought to have solved. At least maybe for the next year until I find another solution.
As you know I dropped all indexes of my data warehouse and therefore also unique constrains on my tables. The application should make sure that everything is fine. But I have a regular report which checks, if the constrains are still intact.
Lately after working on a project with huge amount of data, I got hit occasionally with duplicate rows in my hubs and links.
This was the reason to investigate further about why it did happen. Although I have used all measures below, it happened. Maybe the unique constraint indexes have additional functions when loading data to it. I must admit, I’m not the export in that area. I’m educated by mistakes and experience. Therefore, this is a little sum up about my lessons learnt.
Depending on the number of queries running at the same time with SELECT, UPDATE and DELETE, SQL Server decides, which data is locked, read, waited for, overwritten or inserted. There are some methods to influence its decision making.
SET TRANSACTION ISOLATION LEVEL
Working with transactions is usually something for applications having SQL Server as a backend, to ensure, that the data is written correctly to the database spanning multiple table manipulations. In the data area I didn’t respect to work with transactions for some time, as I was usually transforming data from source to target, hardly inserting data in parallel into the same table. With Data Vault it is rather a usual pattern.
The transaction is written like that:
By default I add
OPTION(RECOMPILE)
after fighting against bad execution plans.This layout I had in place, when I discovered duplicate entries. My search for a solution led me to learn more about Transaction Isolation Level. SQL Server has a few of them:
I don’t want to explain all the nifty details about each option. There are plenty of resources at google.
I suspected that although I had all those table locks (see below) in place, data got changed between an INSERT and its SELECT statement. Especially in fact loading with looking up dimension keys. Sounds weird, but I couldn’t help suspecting.
Many websites express, that SERIALIZABLE is the most secure option, but has a very hard disadvantage – it leads to deadlocks. A deadlock is an issue when SQL Server tries to find out what to do next in a safe way. Sometimes 2 queries want to have access to the same resource, but queuing won’t help as other resources are blocked too. SQL Server terminates 1 query telling it that is a „victim of deadlock“. With restartable jobs this isn’t an issue. But still bad.
I added the following statement to my scripts:
All following transactions are then executed with that isolation level.
I was hit with a lot of deadlock errors in the coming weeks. But I got the issue solved about having duplicates in hubs and links.
Hints
Hints are another way to manipulate the SQL Servers decision engine. There are a lot of hints, which we can add to a query. Some are affecting the whole query manipulating the execution plan (query hints) and some the queried table (table hints).
Query hints are added at the end with
OPTION(something)
. As you know, by default I addRECOMPILE
to it to enforce getting the best plan for every data set instead of recycling old plans. Sometimes I addFORCE ORDER
orHASH JOIN
to it, depending on performance.To combat my issue, I rather use table hints. My favorite is
TABLOCKX
. By default, SQL Server tries to lock as little as possible. Sometimes only pages or rows of a table. But I want to be sure that no other query sneaks into a running query manipulating something in between.TABLOCK
will try to grab a „shared“ lock on that table while theX
gets an exclusive lock.My statement looked usually like that:
But somehow, I still got duplicates in my hubs and links. How could that be? It must be something with the
SELECT
query. MySELECT
query has also a part in it where it checks the very same table, I’m inserting data into for records that already exists. Could it be that theINSERT
locks are not applied to theSELECT
tables although being the same table?Reading through other blogs I have seen following recipe:
UPDLOCK
acquires a modify lock telling the system it intends to modify the data but in fact I won’t modify it. I just want that other queries will modify them when I’m reading it.Maybe this is the way to go, maybe not. It’s hard to find something useful in the data warehousing space as usually all the questions asked are for applications and record level issues.
While we talk about table hints, I got also into another issue. In my pattern collection I have 3 different insert methods. Each optimized for perfect fast loading. One is for the initial load. When the table is empty, I don’t need to retrieve the last record e.g. with Satellite loading for comparing hash values and deciding if it is a change or not. If the table is empty, just pump the data in.
So, what happens? The first query will check if there is a record in the table. Then it will run query 1 or query 2. But what happens in the split second between the 2 queries? Or maybe there is another
INSERT
already running? As the first query will run inREAD COMMITTED
it won’t find a record and tells the initial insert to not check for existing data.We must modify it a little and adding table hints.
The
UPDLOCK
will wait until any insert has finished and sets a modify lock. If the query is run with onlyUPDLOCK
it will check for the record and releases the lock immediately again. TheHOLDLOCK
will keep theLOCK
until the transaction is committed. The first statements sets a lot which is kept until the end.The final layout looked like that:
I believe there are way too many locks. And again, I had a lot of deadlock issues to deal with.
Is there not something easier to work with? Maybe something like a queue?
Application Lock
There is another method in SQL Server to work with: Application Lock. This means, it is completely independent of any table locks.
With this lock we can invent a locking mechanism by defining a random name for a resource. As my procedures are loading data to a target, I use the target name as a resource. It has nothing to do with the underlying table and doesn’t affect it.
How to set a lock:
With this statement we request a virtual lock for the resource named ‚table‘. If any other query or stored procedure issues the same statement it will get into the queue and waits until the first lock gets released.
There are 2 levels to define these locks:
If we define
@LockOwner = 'Transaction'
the lock gets released after the commit.But as I could have multiple transactions in one script, I rather like to specify it on session level. The whole script has to run through until I release the lock with the following statement at the end.
My full script looks like that:
Summary
With this method I create locks on all data manipulation objects. The queue will complete query after query and tries not to sneak into other workflows. If I finish an object, I release the lock and everything is good.
The focus is really to have a queue for every
INSERT
orUPDATE
object.SELECT
queries orJOIN
s don’t need to issue this lock. The table is free to read. By the way: with the table hint method above there is also a hint namedNOLOCK
which should disrespect any existing locks. But when I use it, I seldom get any data returned until all locks are released.With application locks I could also move back to
READ COMMITTED
isolation level.Another side effect is, that other non-related queries are not locked too. Even though a
TABLOCKX
is only for a specific table, I have experienced locks on other transactions and in other databases. Lately I made a test. I stopped the whole execution engine on my test machine and started all jobs at the same time. I have never seen so much traffic on my databases and CPU was at 100%.I hope, I have now peace of mind and no dead locks anymore.
By ecki • Business Intelligence 0