In my previous post I documented roughly a data warehouse and some data modeling styles. In this post I want to discuss, how I came to the conclusion how I want to ingest my data and serve them to my user.
So what we know for sure is, that we have some sources and our data product we produce are data marts. Which steps in between are making sense?
In a Kimball-Style data warehouse, you would have the following steps in between
- Staging Area (multi-layered to conform data)
- Data Mart
Well, that was my first approach and somehow I was successful. I got fast to my target and was able to present data immediately. The problems started when I had to consolidate data from many different source systems. Customer 1 in system A was not the same as Customer 1 in system B. I needed to redesign my loading procedures over and over again.
To the rescue came Dan Linstedt with Data Vault. So, his steps are:
- Staging Area
- Raw Vault
- Business Vault
- Data Mart
The advantage is that the whole system is auditable as it focuses on the preservation of any data (unmodified) over time and its translation to any business formula. For that you would need to know which business objects (hubs) the business knows and wants to talk about. Some IT-focused BI developer translates business objects to their source systems, which is not the same. This would be a Source Vault which has no business benefit.
I don’t have any consultants at hand and I didn’t develop 10 Data Vaults yet. I’m still learning how to build it and I don’t want to rely on my modelling techniques (yet). I want to have the freedom to restart my Raw and Business Vault if I made a mistake.
This leads to Roelant Vos, who is a very experienced Data Vault modeler and crazy data developer. He invented a concept of a Virtual Data Warehouse which has the following steps:
- Persistent Staging Area
- Raw Vault
- Business Vault
- Data Mart
This means, his main base is the Persistent Staging Area. On top of it he models with meta data virtualized data vaults and data marts. This means he has the freedom to run many similar data vaults in parallel to discover which model performs best. And on top of that data marts which are also fully virtualized. For that you would need fast hardware capable of doing that. Of course he can persistent some of the heavy views to tables.
For me this has also the advantage, that I can start loading any data which I might need later before the modeling and transforming the data is clear yet. And I can start from scratch, if I find something faster to run. Of course I don’t need to virtualize aswell. I don’t have those powerful machines to calculate hash keys for 100s of millions of rows at hoc. So I persist everything to tables. But my home base is the Persistent Staging Area where everything starts from.
So, this is my data pipeline:
Just a little hint: Maybe you have heard of ETL vs. ELT. In my model I load data to PSA unmodified, no transformations, no anything. Extract & Load. From there to the Raw Vault the same. Only while loading data to the Business Vault I do the transformations which the business requests.
Nov. 11 2017
Data Loading
In my previous post I documented roughly a data warehouse and some data modeling styles. In this post I want to discuss, how I came to the conclusion how I want to ingest my data and serve them to my user.
So what we know for sure is, that we have some sources and our data product we produce are data marts. Which steps in between are making sense?
In a Kimball-Style data warehouse, you would have the following steps in between
Well, that was my first approach and somehow I was successful. I got fast to my target and was able to present data immediately. The problems started when I had to consolidate data from many different source systems. Customer 1 in system A was not the same as Customer 1 in system B. I needed to redesign my loading procedures over and over again.
To the rescue came Dan Linstedt with Data Vault. So, his steps are:
The advantage is that the whole system is auditable as it focuses on the preservation of any data (unmodified) over time and its translation to any business formula. For that you would need to know which business objects (hubs) the business knows and wants to talk about. Some IT-focused BI developer translates business objects to their source systems, which is not the same. This would be a Source Vault which has no business benefit.
I don’t have any consultants at hand and I didn’t develop 10 Data Vaults yet. I’m still learning how to build it and I don’t want to rely on my modelling techniques (yet). I want to have the freedom to restart my Raw and Business Vault if I made a mistake.
This leads to Roelant Vos, who is a very experienced Data Vault modeler and crazy data developer. He invented a concept of a Virtual Data Warehouse which has the following steps:
This means, his main base is the Persistent Staging Area. On top of it he models with meta data virtualized data vaults and data marts. This means he has the freedom to run many similar data vaults in parallel to discover which model performs best. And on top of that data marts which are also fully virtualized. For that you would need fast hardware capable of doing that. Of course he can persistent some of the heavy views to tables.
For me this has also the advantage, that I can start loading any data which I might need later before the modeling and transforming the data is clear yet. And I can start from scratch, if I find something faster to run. Of course I don’t need to virtualize aswell. I don’t have those powerful machines to calculate hash keys for 100s of millions of rows at hoc. So I persist everything to tables. But my home base is the Persistent Staging Area where everything starts from.
So, this is my data pipeline:
Just a little hint: Maybe you have heard of ETL vs. ELT. In my model I load data to PSA unmodified, no transformations, no anything. Extract & Load. From there to the Raw Vault the same. Only while loading data to the Business Vault I do the transformations which the business requests.
By ecki • Business Intelligence 0