“All the data, all the time” – Data Vault modeling
In case of a classical data warehouse data are denormalized and stored in a so-called start schema modell. Because the main aim of such solutions is to serve the users with data as fast as possible and speed needs redundancy and hence denormalization. However, sometimes a certain level of normalization is necessary to make certain operation more manageable and automatized. To resolve the contradiction between these two methodologies, a new kind of hybrid data-warehouse model, the Data Vault data-warehouse model was created by Dan Linstedt. The model is trying to bring about the best of two worlds: that of the classical normalized 3NF and the star schema approach.
Advantages:
- easily expandable, developable, scalable
- shorter development cycles
- no redundant data storage
- easily parallelizable loading process
- strong rule system
- very flexible and the modifications can be handled easily
Disadvantages:
- a lot more tables than in a traditional data-warehouse; therefore:
- a lot more joins
- complex algorithms and calculations
- really few experts
We are lucky that some of these few experts are the members of our team! Here is a short summary about our latest project:
Unlike the classical data-warehouse approach, the Data Vault model has four layers. The following ones have been created via this project:
- EXTRACT: this is the Source layer, to which extracts are imported through external tables
- STAGE: checked, cleaned and formatted data are imported here with the usage of external
tables - WORK: this is a technical layer where the delta initialization of incoming extracts and the
already imported data happens. Furthermore, it contains calculated records and IDs
generated with HASH algorithm. - STORE: The time series tables of Data Vault can be found here. Data Vault defines three
different table types: HUB, LINK and SATELIT, and besides these, it also uses more types of
technical auxiliary tables, from which PIT table is applied in the current project.
Some explanation of the Data Vault table types:
- HUB: it contains the unique business IDs (business keys)
- LINK (LNK): it contains the connections of the HUB tables. Its key is the compound key
created with the business keys of the HUB tables - SATELIT (SAT): it contains the business attributes sorted by the period of validity. A SAT
table contains data related to a certain aspect. - PIT: an auxiliary table of Data Vault, that connects a SAT table with a belonging to a HUB one
A FRAMEWORK schema has also been configured in order to control the import processes through
the programmes and with the help of META tables of it. Besides, it’s important to mention that the
system avoids the traditional IDs generated from sequences in the STORE layer. Instead, they are
created by using Hash algorithms based on natural IDs, making data more easily
transferable between different environments.
This project is an ongoing one, so the results, consequences and lessons learnt will be displayed in a
later post.