Aug 10 2019
As the amount of objects in my data warehouse is increasing, structuring them became an important topic. The goal is, that also new employees will grasp the content immediately.
There are different naming conventions on the internet, but they usually just tell about prefixing objects. This is probably the easy part. How about structuring hubs with 100s of satellites?
There is no general rule. I believe, that most other developers like to develop their own style. And when you have a look into the data warehouse maybe you can identify who was working on them. Like people who can tell, which code came from which hacker or virus programmer.
So this is my style!
In WhereScape I work with projects. A project is just a collection of objects. And these projects can be grouped.
My project numbering follows the value flow. It starts with the load from the source and ends with the export to destinations.
I understand that there is also different approach to have a more holistic approach, combining all steps into one project, and deploy that as a whole. This didn’t work for me. If we would be a huge team, I would reconsider that.
|1||Load||Landing Area. Import all external data into this area|
|2||Data Store||Persistant Staging Area. Historize all data from the source ever received.|
|3||Data Vault||Build Data Vault Objects like Hubs, Links and Satellites.
I don't separate Raw & Business Vault. The lineage will tell what is raw and what is calculated.
|4||Data Mart||Build Data Mart Objects like Dimensions, Facts and Aggregations.|
|5||Reports||Reporting Area managed by BI|
|6||User generated objects||Reporting Area managed by the User|
|7||System objects||Preparing the data for any platforms we manage like SQL Analysis Server, Elastic Search. Just Views.|
|8||Exports||Exporting Data to any foreign system. Historize all data sent.|
|9||Meta||Meta data, maintenance, script collection, etc.|
The list above are just the main nodes. The next 1-n digits are used to order the projects according to the execution steps.
Consequentially I use the naming convention also for my databases.
Usually I have a main database without a suffix. If I see that the source or the area will get quite big, I separate them.
Here a list of how I name my main objects:
|Object Type||Naming Scheme||Description|
For file loading I add a version to it. See my blog
For loading data in parallel from multiple same looking databases, I use "group" to separate them.
|Data Store||Import oriented history|
|Export oriented history|
|Hub||Depending on the way you save business keys (concatenated or composite) you might come into the situation to have multiple same hubs with different business keys. Separate them by business key space (it is not the source system!)|
|Link||Hubs are sorted by size and importance|
|Satellite||Hub Satellite. I don't separate them by raw or business vault. Content rules! If I'm interested what is raw and was calculated I follow the lineage.|
|Link Satellite for effectivity|
|Dimension||If there are more similar dimensions, separate them by area|
|Fact||If there are more similar facts, separate them by area|
|View||Views for specific purpose|
|Shows the current active record of a business key (e.g. Data Store)|
|Shows the first value of a business key|
|Shows the last known value of a business key (e.g. Data Store, Satellite)
In contrast to current, a record could be marked as deleted.
|Shows the timeline with start date and end date based on dss_load_datetime or dss_event_datetime|
|Similar to timeline but the first start date is 0001-01-01|
|Specific transformation for cubes|
|Error views to automatically QA the data inside the object or compare against former objects for completeness of data. Are checked daily.|
|Stage||Temporary tables and views to organize the calculation steps towards the target.
Next time I will rather name that
All steps report results to the upper level. Sub-steps are kind of sub-routines to report back results to the main steps
|Source Mapping||WhereScape specific object to map multiple sources to one target.|
|Procedures||Default unmodified procedure created by automation solution. Updatable.|
|Modified procedure created by developer|
|User-defined procedures for shared functionality|
|Meta data procedures|
Double-underscore “__” have the meaning of being “interpretation” or “steps”. The main object name has only dashes to separate the words.
And here is the list about all my columns and system columns used. Quite a lot.
|Link Hash Key|
|Hub Hash Key|
|Business Keys. No rule.|
|Data columns. No rule.|
|If we need to persist all events (also with no changes in data). Here it will get into the dss_change_hash and therefore persisted.|
|Used as a data column in effectivity satellites. A satellite without content doesn't exist in my system.|
|Used to help to process FULL, DELTA and CDC loads into Data Stores|
|The event datetime can also be used in the data area. Here it is used to automate the timeline calculation|
|Calculate the start date based on dss_event_datetime|
|Calculate the end date based on dss_event_datetime|
|The first time the record appeared in my data warehouse. See my blog.|
|Numbering of a loading batch. Either on the way in or calculating after landing the batch.
Persisted in Data Stores to form a primary key, to be added as a sub-sequence key in satellites (=multi-active) or alternatively manipulate in edge cases the dss_load_datetime in satellites.
|Where does the data come from.
|Change Hash of all data columns. Easier to compare only 1 columns with the target if something has changed than comparing all columns with the target.|
|Calculate the start date based on dss_load_datetime|
|Calculate the end date based on dss_load_datetime|
|Flag a record as current or old in Data Stores|
|Flag a record as active or deleted in Data Stores|
|Datetime when a record got inserted into the table|
|Datetime when a record got changed in the table|
The list above is also my sort order. The system sorts automatically the columns accordingly.
And for those who work with templates and automation, here my list of naming conventions:
|DDL||A DDL generator is always technology dependent. It creates views or tables. Objects are from above and descriptions are all those different views, such as
|Procedure||Creates loading procedures to load data into objects.
Stored procedures are technology dependent.
|Script||Creates loading procedures to load data into objects.
Scripts are usually technology dependent. They might be written in PowerShell, Python or other languages, but they load data to a specific environment. Although they might all talk SQL, they behave different and can be optimized differently.
|Block||Reusable patterns to build DDLs and procedures. Commands are SELECT, INSERT, UPDATE or DELETE.|
|Utility||Library to support all above. I have configuration, variables, elements, snippets,
metadata, ddl, procedure, etc.
Both technology agnostic and technology dependent.
I hope, somebody finds that useful. If you have more stuff to add or have another approach, please get in touch. I’m interested in learning your approach.