Data Warehouse Naming Conventions

As the amount of objects in my data warehouse is increasing, structuring them became an important topic. The goal is, that also new employees will grasp the content immediately.

There are different naming conventions on the internet, but they usually just tell about prefixing objects. This is probably the easy part. How about structuring hubs with 100s of satellites?

There is no general rule. I believe, that most other developers like to develop their own style. And when you have a look into the data warehouse maybe you can identify who was working on them. Like people who can tell, which code came from which hacker or virus programmer.

So this is my style!

Projects

In WhereScape I work with projects. A project is just a collection of objects. And these projects can be grouped.

My project numbering follows the value flow. It starts with the load from the source and ends with the export to destinations.

I understand that there is also different approach to have a more holistic approach, combining all steps into one project, and deploy that as a whole. This didn’t work for me. If we would be a huge team, I would reconsider that.

StepProjectDescription
1LoadLanding Area. Import all external data into this area
2Data StorePersistant Staging Area. Historize all data from the source ever received.
3Data VaultBuild Data Vault Objects like Hubs, Links and Satellites.
I don't separate Raw & Business Vault. The lineage will tell what is raw and what is calculated.
4Data MartBuild Data Mart Objects like Dimensions, Facts and Aggregations.
5ReportsReporting Area managed by BI
6User generated objectsReporting Area managed by the User
7System objectsPreparing the data for any platforms we manage like SQL Analysis Server, Elastic Search. Just Views.
8ExportsExporting Data to any foreign system. Historize all data sent.
9MetaMeta data, maintenance, script collection, etc.

The list above are just the main nodes. The next 1-n digits are used to order the projects according to the execution steps.

Databases

Consequentially I use the naming convention also for my databases.

Database Name
bi_10_load_{partner/system}
bi_20_data_store_{partner/system}
bi_30_data_vault
bi_40_data_mart_{area}
bi_50_report_{area}
bi_60_user_{department/user}
bi_70_system_{name}
bi_80_export_{partner/system}

Usually I have a main database without a suffix. If I see that the source or the area will get quite big, I separate them.

Objects

Here a list of how I name my main objects:

Object TypeNaming SchemeDescription
Loadload_{partner/system}_{table}_
{version}__{group}
Landing area.
For file loading I add a version to it. See my blog
For loading data in parallel from multiple same looking databases, I use "group" to separate them.
Data Storeds_{partner/system}_{table}Import oriented history
ds_exp_{partner/system}_{table}Export oriented history
Hubhub_{business_concept}_{bk_space}Depending on the way you save business keys (concatenated or composite) you might come into the situation to have multiple same hubs with different business keys. Separate them by business key space (it is not the source system!)
Linklink_{hub}_{hub}..._{bk_space}Hubs are sorted by size and importance
Satellitesat_{hub}_{content}Hub Satellite. I don't separate them by raw or business vault. Content rules! If I'm interested what is raw and was calculated I follow the lineage.
lsat_{link}_{content}Link Satellite
lsat_{link}_effLink Satellite for effectivity
PITpit_{hub/link}Point-in-time object
Bridgebridge_{content}Bridge object
Dimensiondim_{dimension}_{area}If there are more similar dimensions, separate them by area
Factfact_{fact}_{area}If there are more similar facts, separate them by area
View{object}__{description}Views for specific purpose
{object}__currentShows the current active record of a business key (e.g. Data Store)
{object}__firstShows the first value of a business key
{object}__lastShows the last known value of a business key (e.g. Data Store, Satellite)
In contrast to current, a record could be marked as deleted.
{object}__timelineShows the timeline with start date and end date based on dss_load_datetime or dss_event_datetime
{object}__infinitySimilar to timeline but the first start date is 0001-01-01
{object}__cubeSpecific transformation for cubes
error_{object}__{rule}Error views to automatically QA the data inside the object or compare against former objects for completeness of data. Are checked daily.
Stagestage_{object}__{step_id}_{step_name}
stage_{object}__{step_id}_{step_name}__
{sub_step_id}_{sub_step_name}
Temporary tables and views to organize the calculation steps towards the target.
Next time I will rather name that temp instead of stage.
All steps report results to the upper level. Sub-steps are kind of sub-routines to report back results to the main steps
Source Mapping{object}__{source_object}WhereScape specific object to map multiple sources to one target.
Proceduresupdate_{object}Default unmodified procedure created by automation solution. Updatable.
custom_{object}Modified procedure created by developer
user_{}User-defined procedures for shared functionality
meta_{}Meta data procedures

Double-underscore „__“ have the meaning of being „interpretation“ or „steps“. The main object name has only dashes to separate the words.

Columns

And here is the list about all my columns and system columns used. Quite a lot.

Column NameDescription
link_{link}_keyLink Hash Key
hub_{hub}_keyHub Hash Key
dim_{dimension}_keyDimension Key
business key columnsBusiness Keys. No rule.
data columnsData columns. No rule.
event_datetimeIf we need to persist all events (also with no changes in data). Here it will get into the dss_change_hash and therefore persisted.
is_activeUsed as a data column in effectivity satellites. A satellite without content doesn't exist in my system.
dss_data_set_typeUsed to help to process FULL, DELTA and CDC loads into Data Stores
dss_event_datetimeThe event datetime can also be used in the data area. Here it is used to automate the timeline calculation
dss_effectivity_datetimeCalculate the start date based on dss_event_datetime
dss_expiry_datetimeCalculate the end date based on dss_event_datetime
dss_load_datetimeThe first time the record appeared in my data warehouse. See my blog.
dss_sequence_noNumbering of a loading batch. Either on the way in or calculating after landing the batch.
Persisted in Data Stores to form a primary key, to be added as a sub-sequence key in satellites (=multi-active) or alternatively manipulate in edge cases the dss_load_datetime in satellites.
dss_record_sourceWhere does the data come from.
{partner}_{database}_{schema}_{table}
{file path}
dss_change_hashChange Hash of all data columns. Easier to compare only 1 columns with the target if something has changed than comparing all columns with the target.
dss_start_datetimeCalculate the start date based on dss_load_datetime
dss_end_datetimeCalculate the end date based on dss_load_datetime
dss_is_currentFlag a record as current or old in Data Stores
dss_is_activeFlag a record as active or deleted in Data Stores
dss_create_datetimeDatetime when a record got inserted into the table
dss_update_datetimeDatetime when a record got changed in the table

The list above is also my sort order. The system sorts automatically the columns accordingly.

Templates

And for those who work with templates and automation, here my list of naming conventions:

DDL{company}_{technology}_ddl_
{table/view}_{object}__{description}
A DDL generator is always technology dependent. It creates views or tables. Objects are from above and descriptions are all those different views, such as last or current
Procedure{company}_{technology}_procedure_
{object}_{special_case}
Creates loading procedures to load data into objects.
Stored procedures are technology dependent.
Script{company}_{technology}_script_
{object}_{special_case}
Creates loading procedures to load data into objects.
Scripts are usually technology dependent. They might be written in PowerShell, Python or other languages, but they load data to a specific environment. Although they might all talk SQL, they behave different and can be optimized differently.
Block{company}_{technology}_block_
{command}_{topic}
Reusable patterns to build DDLs and procedures. Commands are SELECT, INSERT, UPDATE or DELETE.
Utility{company}_utility_{topic}
{company}_{technology}_utility_{topic}
Library to support all above. I have configuration, variables, elements, snippets,
metadata, ddl, procedure, etc.
Both technology agnostic and technology dependent.

I hope, somebody finds that useful. If you have more stuff to add or have another approach, please get in touch. I’m interested in learning your approach.