Nov 10 2018
How many times did you hear in the last few years the words “Data Lake”? And how many times have you asked yourself how it should be organized? And how often did you ask Google for more information and didn’t get any valuable results and only funky buzzwords? I did a lot and found not enough answers.
Over the time, I figured out myself an optimal design for organizing the very first layer of a data lake or in old school language a ‘file repository’. And recently I completed it by asking the experts at a DDVUG meetup, where Michael Olschimke showed us their solution. Thanks Michael.
So, what layers have I found:
This layer represents the data provider. If it is from an external partner, I name it after him. If it comes from an internal system, I name it after that.
It could be also the technology, which can be the master of all data following. E.g. mysql, kafka, etc.
There are also tons of files which are produced by my company and don’t fall into any category. E.g. Excel-Files, manually maintained data, etc. I set the structure above anything, so we as a company are a source too.
If there is a chance of having multiple connections for the same source, add a connection layer. If there is the slightest chance of having another set of credentials, separate it.
Usually I start collecting data and process them. But some when in the future I start to pump data in the opposite direction. So where to put these files? Could be a completely different file repository. Or it can be combined in this repository.
This is the place to specify the kind of operation we do. I have:
- Update (log files of operation on the source system I do)
This layer could be used to group the source objects into a structure. E.g.
Could be also a directory structure like database_name/schema_name if data from SQL systems are to be saved.
This is the data object. In data warehousing this is the table name. Another way to describe it, is to think about all other following directories and files below summed up at this level. If a file doesn’t match the object definition or object meaning, create another one.
I had difficulties to find a proper word for that layer. Maybe content is more likely. Choose your favorite.
Examples are ‘customer’, ‘stock’ or something like ‘marketing_cost’. You get the point.
/data set type
What happens if we have different data sets for the very same object?
I have seen the following data sets so far:
Full means that we get the files every day in full. E.g. a customer table export from an application. Or an Excel-File which is modified by the users.
Delta means that we only get changed records. E.g. log files, transaction files.
CDC (Change Data Capture) is a very detailed transaction log file of a database having all
DELETEs of the source system. Usually it is ingested directly from a database to another database.
If the rows in a file are just data, why do we need to separate Full and Delta at all? The only reason for separating them is, if we need to detect deleted records in the source system. Then I’m able to create a status tracking satellites or in non-data vault systems to delete the record or mark it as deleted.
The difference between Delta and CDC is, that CDC contains information about deleted records.
Version? What is a version for? This principle I derived from APIs. E.g. Google API evolved from v3 to v4. What happens, if the source suddenly decides to modify the file structure? E.g. translating all column names from German to English? Or modifying from csv to json? Adding columns is not a new version. The loading procedure should cope with that.
The main reason for a new version is, if the existing loading procedure can’t process the file. Then add a new version and write a new loading procedure.
My default version is ‘1’. I have only a few objects were a ‘2’ had to be created. But I want to be prepared! Modifying an existing layer after a need occurs, is not future-proof. Tweaking loading procedures to understand were something ends and something new starts is bad.
If you are going to model your data lake or file repository to build data vault objects, this layer is a must have. You’ll want to preserve, when the data first appeared in your data lake. The definition of the load_date can be found here.
All others can probably omit that.
Sometimes I need to ingest data by different characteristics. E.g. per company, per shop, per account, per machine, per anything. Per day is not a valid group (see below in event date).
Usually the differentiator is not part of the file name but part of the directory structure of the source system. Here is the place to preserve and save it.
The semantics of a group and connection could be kind of equal. Sometimes it makes sense to use it as a connection above or as a group here.
Note to myself 1: Why not placing group before data_set_type? Because we need a loading procedure for each data_set_type. Groups and the rest can get looped through.
Note to myself 2: Why not placing group before load_date? Could be, but a load_date should be a single entity. It serves better for incremental loading instead of having it inside each group. A group is data and data comes at the end.
Note to myself 3: Why not placing group after event_date? Sure, feel free.
First question: Does there exists an event date as content inside the file? If yes, all good! No need for an extra layer. Nevertheless it could be created in a Hive environment for speed purposes to partition the files.
Second question: Should the file have an event date? If yes and we don’t have it, we need to add this somehow to the path.
Maybe the file has a string in the filename like ‘YYYY-MM-DD’ or an integer value representing a timestamp. Then add this information as an event_date.
If there is nothing available, use as a last resort the create datetime of the file.
Structure it as /YYYY-MM-DD or /year/month/day. As you like.
Do we still miss something? If a key information is missing inside the file, add it as a layer. Manipulating files to add a missing column is a no-go!
The only way possible to add more information is to create them outside the file content. Maybe as another layer.
Actually group and event_date from above are just examples of missing data.
Finally, our source file.
If you ever need the create datetime of the file, preserve it by renaming the file. It doesn’t survive the copy process to any cloud platform and is lost.
And don’t give too much meaning to a filename. To extract data from the filename is difficult and needs precise instructions, where to look for what. No automation.
A file represented by a filename is just data!
All these components can be used as directories or if your system supports it, as meta data to the files. If you use meta data, you will probably find and use even more tags to describe your files.
Choose which elements from above are important for you. In a directory structure, the full string will look like this:
A very very long path and people might get lost. But computers will like it, as many pieces can help to automate the loading procedures. Some systems are even suggesting to use key=value as a directory name, e.g. year=2018 or shop_code=de. This makes it even easier to automate.
The absolute minimum structure would be in my opinion:
If you believe that a table with 2 versions could be automated too, then the 2nd version is not of importance ;-).
For ingesting data, we need a procedure for each combination of
If you believe that there are way too many directories and you don’t need all these nifty things, don’t use load_date and modify the filename to include the omitted layers:
Please tell me your solution, if you believe I missed a case.