Apr 7 2018
Universal Loading Pattern
While working on the templates I published recently, I came across similar looking pattern between object types. E.g. the persisted history of a satellite looks the same as in a persistent staging area (except of the primary key of course). Or loading a dimension is like loading a hub (just that hubs have no change hash). So, a dimension loading pattern would nearly work for hubs.
I concluded that instead of having 7 templates (for each object type) 2 would be enough. One for stage and one for any target. They have all similar patterns.
While compiling all the patterns, I came to a solution that even 1 template would be enough. A universal loading pattern. Usable for any target type!
As we are loading data with stored procedures generated with WhereScape, see the following captions as sub-queries. In that way we add value in each layer to the result set. Also for maintaining template code this is much easier. SQL Server will find the best way to execute them.
I separated them into stage and target loading for easier reading. But in reality, it is just one big data flow.
Stage Loading Process
What would the loading pattern for a stage object look like?
From my first draft only little changed.
Extract & Transform
The only user generated logic is defined here. A simple SELECT
statement and some additional meta data about what to save in addition, e.g. hash keys, change hashes or dimension keys. Everything afterwards is automated and for data preparation.
Delta Loading
[themify_box icon=“highlight“]Target Object Name is defined[/themify_box]
A stage table can optionally have a link to its target. With that in place delta loading is possible and will add the following WHERE
clause.
WHERE dss_load_datetime > (SELECT COALSESCE(MAX(dss_load_datetime),'0001-01-01') FROM [Target])
CAST to Target Data Type
[themify_box icon=“highlight“]Has Hash Columns[/themify_box]
Hashing is a beast of itself. It looks trivial, but can still be tricky. Just recently we figured out, we would need to add another step in between: Casting to Target Data Type. E.g. the source column is of data type datetime
and the target column has date
. Of course, nobody does that ;-)! Hashing without casting would mean that it hashes e.g. ‚2018-03-08 12:13:14‘ instead of just the date ‚2018-03-08‘. Any rehashing later will fail. Of course, it should have been taken care about in the transformation, but other examples can be found. This step is just adding the invisible implicit transformation to a visual explicit transformation before the hashing step.
Another way to model it, is to create a temporary table with the same data types as the stage table and persist the result before going into the hashing function. This is anyway faster in delta loading because SQL Server comes usually up with odd execution plans.
Clean Business Key for Hash Keys (other Business Keys are not touched)
[themify_box icon=“highlight“]Has Hash Columns[/themify_box]
There are many options out there. I decided to go with the following:
CHAR Data Types | ISNULL(NULLIF(UPPER(LTRIM(RTRIM(source_column))),''),'-1') |
INT Data Types | ISNULL(source_column,-1) |
DATE Data Types | ISNULL(source_column,'0001-01-01') |
If a CHAR
column is empty I set it to NULL
and every NULL
column is set to -1. Then remove any spaces and set it to upper case.
If an INT
column is empty set it to -1. 0 (=zero) is a good value. No need to NULL it.
Compute Hash Keys and Change Hash
[themify_box icon=“highlight“]Has Hash Columns[/themify_box]
Hash Key | HASHBYTES('md5', |
Change Hash | HASHBYTES('md5', |
Casting everything to NVARCHAR(MAX)
is the easiest thing. NVARCHAR
has the best implicit conversion from any data type to string. Don’t ever cast to VARCHAR(MAX)
only. As long as there is no true UTF8 character set (with variable byte length) for VARCHAR
, stick with UTF16 in NVARCHAR
. Looking beautiful. Only DATETIME
and another (which I can’t remember) are looking ugly. But as I use only DATETIME2
I don’t come across.
Dimension Lookup
[themify_box icon=“highlight“]Has Dimension Columns[/themify_box]
Just looking up the surrogate key.
Row Condensing
[themify_box icon=“highlight“]Business Key is defined[/themify_box]
Removing true duplicates. Be aware. REALLY! Wait until you hit a duplicate and talk to the owner of your source system! And only then implement it, if he can’t fix it.
Load
Preparing everything from the previous steps and sort the columns in the right order for the INSERT statement.
Post-Process Update Dimension Key
[themify_box icon=“highlight“]Has Dimension Columns[/themify_box]
If a business key in the stage table for a fact object doesn’t exist in the dimension we could add it if this is a desired outcome.
After INSERT
ing the missing business key into the dimension, we need to update the stage table.
Target Loading Process
The target loading can be as simple as having a short INSERT
statement.
Which differences are in the target object types:
- Persist timeline calculation? (e.g. having EndDate in a table or calculate it through views)
- Update existing or insert as new row ?(e.g. Dimension vs Satellite)
- Is it allowed to insert „in-between“ data? (e.g. Satellite)
Row Condensing
Here happens the real magic, where every target type is different. Now is the right time to figure out, if we need to persist all the rows.
Target Type | Rule |
---|---|
Stage | We need only the last record by Business Key (if it is specified) |
Data Store | We need only changes in rows sorted by LoadDate |
Hub | We need only the first record by Hub Hash Key |
Link | We need only the first record by Link Hash Key |
Satellite | We need only changes in rows sorted by LoadDate and sorted by EventDate. |
Dimension | Dimensions without SCD should be already unique in stage. No condensing. |
Fact | It should be already unique by Business Key in stage. No condensing. |
Compare last saved row with first new row
[themify_box icon=“highlight“]Has StartDate Column[/themify_box]
An additional step is to check the last saved row with the first new row. Happens only in tables with a timeline column. Usually Satellites or History.
The result after this step is, that the data set is now cleaned, useful, ready to be inserted in the target. From a data point of view.
History Columns
[themify_box icon=“highlight“]Has History Columns[/themify_box]
The history columns can be deferred to a view. Then we have an INSERT
only system which can have a huge performance gain. The question is, if we can wait that long for the calculation of the timeline in the view to finish. If not, we should persist the history columns.
Column names for the first timeline. The calculation is based on the LoadDate.
- dss_start_datetime
- dss_end_datetime
- dss_version
- dss_current_flag
Column names for the second timeline: The calculation is based on the EventDate.
- dss_effectivity_datetime
- dss_expiry_datetime
For templates, not all columns are needed. Just add those which you need.
I know, this process fails if we are talking about bi-temporal objects. I need to learn first, how to build and use them.
Load
This step brings all the data from the previous steps together and adds the columns, which were missing in the extract layer.
Where (Not) Exists
This step was the most difficult one. Not because of detecting what exists and what not, but where to place?
If I put it after the extract layer, it would be impossible to do a full load. All existing records are removed from the data set and the history calculation is therefore rubbish. Also in delta load are problems. What about loading data in-between LoadDates? The whole row condensing step will produce wrong results.
What about placing it after the row condensing layer? Seems to me the best place. I could save time calculating the history layer as I’m not going to update an existing row anyway. But I would need to UPDATE
the timeline after the INSERT
. Any single UPDATE
costs me 10 times more than a proper INSERT
.
And where do we have history calculations anyway? Only with satellites and data store. Therefore, it can be placed at the last place possible.
What is new and what is existing? This are the rules for detecting it:
Target Type | Definition |
---|---|
Data Store | Business Key, Load Date and Sequence No |
Hub | Hub Hash Key and Business Key (the unique constrain will tell if there is a hash collision) |
Link | Link Hash Keys, Hub Hash Keys and peg-legged Business Keys (the unique constrain will tell if there is a hash collision) |
Satellite | Hub Hash Key or Link Hash Key and LoadDate |
Dimension | Business Key |
Fact | Business Key |
Which objects could have an UPDATE
statement? Maybe not only in delta but also in full load?
Target Type | Definition |
---|---|
Data Store | No, there can't be any changes in the data. Otherwise the LoadDate is used wrongly. |
Hub | No. If it is of interest, when a business key became available, we can update the LoadDate if there is an earlier one. |
Link | No. If it is of interest, when a business key became available, we can update the LoadDate if there is an earlier one. |
Satellite | No. If by accident there are different Change Hashes for a given record which already exists, the whole calculation in stage must have changed. Then the whole satellite should be reloaded instead. |
Dimension | Yes! |
Fact | Yes! |
For Facts DELETE
ing the record could be a faster option than UPDATE
ing it.
Everything else is going to be INSERT
ed.
Post-Process Update Timeline
[themify_box icon=“highlight“]Has History Columns for a timeline (StartDate only is not a timeline)[/themify_box]
In that step we are going to update the timelines of our target. By joining to the stage table, we only need to update the changed timelines per Business Key/Driving Key.
In our thinking we had first also a Pre-Process Update step where we would outdate old data. This works, if we run the INSERT
batch-by-batch. But as we don’t know, if there will be any inserts at all after the „Row Condensing“ layer we must omit that pre-processing step.
Conclusion
Above you see 2 images. Actually, in my data flow they are not separated. It is just a huge select statement incorporating every rule. Of course, empty layers are not shown.
Now, I can just throw any data for any target type at it. By designing it in layers, I’m flexible in adding new rules when I discover them. I’m also able to take out each layer in a separate temporary table to speed up processing time, if I figure out that this way would be faster.
By writing down my thinking gave clarity to the desired outcome. If you want to discuss this model or want to share your experience with me, please get in touch.
Jun 2 2018
Love-hate relationship with WhereScape
[Update see below]
Well, I don’t know if this is the right way. But I want to share my experience about it with others.
First, I want to say that WhereScape enabled me to reach my goals faster. At least now. It took a year to get to full speed. This is due to the fact that I had first to discover, how it works and what is important to me. The first sentence of the sales consultants was: „We can do whatever you want!“.
You have to take that really literally! Don’t underestimate it. It is a blessing that you can form this tool to your likings. It is a curse that it takes time and has no blue print attached. Well, WhereScapes has own templates which are filled with options, accumulated from the beginning to present by user requests. Either I click through the whole process over and over again or I develop my own templates to remove manual work.
Object creation
It is a dream to design transfers. Create the first load object and then drag the objects or the columns from one step to next. E.g. Load → Stage → Data Store → Stage → Satellite → Stage → Dimension → Stage → Export. So easy with drag & drop. From one or multiple source tables. And then just building the right
JOIN
statements and an optionalWHERE
andGROUP BY
clause.The objects are created with the right data type with a mouse click. Stored Procedures are created without further input. Indexes are added automatically if needed. I just have to worry about business logic implementation.
But to get there has taken us a long time. And it has some limitations.
32 bit vs 64 bit
The application is still running in 32 bit. For a meta data management platform this is ok. It gets nasty when I have to work with execution.
My loading works with SSIS, ODBC or PowerShell. In ODBC I have to use a 32 bit ODBC driver. Calculate yourself, how much memory I can use for a data transfer from a database? Depending on row size crashes happen at around 50 to 100 Mio. records.
PowerShell is another beast. Windows has 3 versions of it. The 32 bit Version in the 64 bit Windows directory and vice versa. And then there is a third version as well. Figuring out which works in combination with WhereScape for a given job is tiresome. And if you need to extend PowerShell with more functionality, most modules are in 64 bit. So I have to start a 32 bit PowerShell, switching to a 64 bit version and then start executing.
Object name length
This limitation is really bad! RED has a limit of 64 characters for any object like tables, columns, indexes, stored procedures, etc. Only „parameters“ (kind of global variable) have a length of 256 characters but the system can address only 64 of them?!?
The 64 characters is for all objects. So if I create a table with 64 characters, I still get errors because of other objects appending to its string.
update_{object name}
{object name}_idx_BK
With this calculation I have only 57 characters left.
In 3D there is a limit at 1024 characters. Enough space to evolve.
I hate abbreviated object names. Nobody except me is going to understand them. If you think of something like
stage_{satellite}_{source system}_{source table}_{calculation step}
, you can see that I might run out of space easily. And when I need to compose a multi source mapping and addingsrc_{hub}_
to the front, this runs out even faster.3D to RED deployment
As described, there is a discrepancy of 1024 to 64 characters for object names. When bringing objects back to RED, object names are cut at 64 characters. For custom indexes it gets cut at 40 characters already. I don’t know why.
So I need to take care about object naming inside 3D. 3D is a nice program to design. But if I need to take care about object length for later usage in RED, it gets cumbersome.
My aim is currently to get a rough idea in 3D and then do the transfers in RED directly. With my customized templates this is as fast as working in 3D.
Object defaults
In 3D, I can set default column names with default data types. This is not possible in RED. E.g. I improved performance by switching from
CHAR(32)
toBINARY(16)
. Or, all of my date system columns should bedatetime2(7)
. Not possible to set. Unfortunately until recently I had to manually modify them or using a QA query to modify the meta data in the database regularly. When I started writing this blog entry, I thought why not using triggers for that?!? Well, after such along time without it, I have that now in place.In my opinion, users shouldn’t interfere with the application database directly. This is really really bad behavior, but unfortunately necessary.
Pebble
Pebble is the templating language WhereScape introduced. The intention was right to give the user the flexibility to develop their own transfers. It has taken me a year to develop fail prove transfers to get industries standards mapped. It is a trade off swapping manual time with development time. But satisfying when I get more insights, how things should work.
With successful usage the appetit grows to get more manual work incorporated into my templates. Then the problems are starting to appear. Pebble doesn’t scale with complexity. Macros suddently stop running, variables are getting forgotten while compiling. If you don’t check each and every creation, suddenly a bad stored procedure is there. Very unstable. Then time is wasted to find alternative ways to accomplish the same output. Another developer would question my bad design. But it has to work and I can’t wait!
For small and easy templates without logic, Pebble is running well.
Object Types
I like the split for every type of table I need: Stage, Data Store, Hub, Link, Satellite, Dimension, Fact, Export, etc. I love the concept of having a load object for getting outside data into the data warehouse. This is far more logical than having a stage table directly and load that content to the target.
Stage is my workhorse. It is there to prepare the data from target to target. Load is also a target.
7 object types = 7 templates to rule them all. But bringing logic inside it was not easy. A long and steep learning curve.
Stage Table and Stage View
See the Key Features graphic on their website: https://www.wherescape.com/solutions/automation-software/wherescape-red/
You see that there are options for:
Stage View doesn’t exist!
All target objects like Stage, Data Store, Hub, Satellite are tables. And there is an additional object type called View. This is the only view.
But when developing a transfer, I don’t want to recreate the object as a view when I want to find out, if this runs faster. If I have a stage table, I want to be able to switch between a view and a table.
Objects are grouped for each target type. So switching between the stage menu node and view menu node is cumbersome.
There is an option to create a DDL statement. I developed a template for that. But once a view, it stays a view. I can’t switch back to a table without going manually to the database and dropping the view.
Data Vault
WhereScape is proud of their Data Vault skills. But it has taken them very long time to get that right. To add columns to a HashKey or ChangeHash is a dream!
The defaults are running ok. But they forgot that hubs are loaded from multiple stage tables. And just until recently I had to circumvent that with creating fake objects and using fancy templates. And in my opinion Links are not built correctly. The Unique Constrain of Hub Hash Keys are not enforced.
Advanced forms don’t work:
Meta Data
The application collects all the meta data necessary to implement an object. It is really an open application with nearly total freedom. You are absolutely free to implement bugs and stuff which doesn’t work.
At first, every team member evolved their own development style. Through discussions and documentation we agreed on some principles and layout of our objects. But things are getting forgotten or are not getting checked.
The need arose to write some QA queries to check for issues. This proved to be very unreliable as there are too many bits and pieces to configure for each object.
This is the downside of having total freedom. Other programs which are guiding you through the development steps might be easier and more consistent to work with.
Jobs
When developing, I create tables and stored procedure. Through lineage I see which object depend on others. This is really a great feature.
When adding a job based on my work, I click the needed objects together. There it should end. But… I have to group them by execution layer and setting the dependencies manually?!? By default it maps EVERY object from one layer to EVERY object in the next layer. Then I need to delete the obsolete mappings. I mean, the system knows the dependencies of each object. Why the hell I need to teach the program it again?!? I see the need of a manual overwrite. But the first draft should be done by itself.
Another issue is that I can set the dependencies on another job. This is great and runs well. But sometimes, especially when developing, I want to force start this job. If the dependent job didn’t run, this job will not run either. Either I delete the dependency or I add another copy of the same job without a dependency. The later we do currently.
Support
The support is based in New Zealand. If I have an issue which I want to show somebody, I will have to schedule here in Europe a slot in the evening to midnight.
Usually this is not necessary. The help provided is fast. For bugs and feature requests, you will receive a number and then the issue is forgotten. Sometimes you see according to the low number it is known for years. The development focus is on delivering more fancy features than ironing out old bugs.
In every release I get newly introduced bugs of stuff which worked before. Then I get an info that they will fix it for the next release. Release cycles are roughly 9 to 12 months.
Support for Pebble doesn’t exist. Somehow understandable, as this is an area for a lot of user questions and resources would be bound. And I believe they already experienced themselves, that Pebble is unpredictable and don’t want to waste time on it. Or maybe we are too small and they make rather big players happy. I don’t know.
Implementation
I envy those companies which are able to implement WhereScape in one week. The success stories of WhereScape are full of this. Maybe they sent a full team of success engineers in and get them to full speed or did that themselves.
Our success engineer was great. Helping us here and there with the applications default functionality. Maybe our „We can do whatever you want“ was too complicated. I can’t imagine how others are doing it. I mean, I have only 7 object types. Loading e.g. a Fact or a Data Store should be everywhere the same. My aim was to get standards implemented accurately, automatically and without too much manual work. And this has taken us a lot of time!
Final Words
I love WhereScape for their flexibility. I hate it because I have to (re-)invent everything. Their default is too „free“ and only basic. I love the drag & drop of columns. I love it for creating tables, views and stored procedures instantly. I can’t do without it anymore. I hate it for their limitations and bugs. I hate/love the job engine. I love it for the speed of deploying objects to production. I hate Pebble.
I love/hate WhereScape!
[Update 09.12.2019]
I see that this blog post is sometimes on the top of google search. In the meantime I have to rectify my view about WhereScape.
Some of the limitations above are still true. But overall I love WhereScape. With my perfectly tuned templates we are terribly fast in developing data products. The effect is, that we don’t maintain any release cycles anymore. Sure we can do that, but my users love the rather prompt help I can offer. Telling the issue, showing a draft, finishing up.
I can’t imagine to work with any box moving applications like SSIS. Rather write the query, select the template and it is done. And then going to the next issue.
By ecki • Business Intelligence 1