Data Vault 2.0 and Big Data

In the previous blog post you discovered the first version of Data Vault methodology. But since the very first iteration, the specification evolved and a few years ago a version 2 was proposed. More adapted to the Big Data world, with several deprecation notes, and more examples adapted to the constantly evolving data world.

New ebook 🔥

Learn 84 ways to solve common data engineering problems with cloud services.

👉 I want my Early Access edition

In this post, you will discover more details about the Data Vault methodology. In the first part, you will see the hubs, links, satellites, and tables as the main building blocks for modeling the approach. In the second part, you will see how the keys are managed. In the 3rd part, you will see how this new specification addresses modern data problems, whereas, in the last part, you will discover an example of the architecture involving Data Vault.

The first part, hubs, are a quite clear concept. ,hey are a list of business keys. An important thing to highlight, and you will see it in the specification, is that the hub doesn't require surrogate keys, in any (hash or sequence) format.

The links are there to represent the relationships, but what it's maybe not clear is that the relationship can be between 2 or more business keys, so presumably between multiple hubs. They never contain validity dates. Different types of such links exist:

The next type of components are satellites. They contain descriptive data for the hubs or links. The data they store may change over time. That's why they contain an effectivity column indicating the time when the given record is valid. There are 3 types of satellites:

In addition to these 3 main building blocks, in Data Vault you'll retrieve the tables, all for different purposes:

Data Vault 2 added an extra component to handle Big Data scenarios, but you will find it in the 3rd section.

Hashes are everywhere

In the first specification of Data Vault, the sequence surrogate keys were allowed to overcome performance issues due to the joins on multiple columns. However, it introduced some constraints during the loading, and with the modern data technologies, it doesn't fit everywhere.

That's the reason why it changed, and the recommended way to deal with it is to generate a hash key from all columns identifying the business key. When can it apply? An example you will find very often is one of the flights. A flight has the same number but occurs on different days. So the business key is not only the number but also the date. Both can be then used to compute the hash to be used as an optional key. And this optional key has few benefits that I will discuss in the next section.

Another place where the hash function can be used is an optional hash difference column. You can add it to detect changes in the satellites for a given business key instead of performing the comparison column by column.

Why Big Data-compliant?

In 2 previous sections, you already saw a few changes that help Data Vault adapt to the modern data architectures. The first adaptation is the extra component from the first section of the blog post called landing zone files. They're the files written into an environment characterized by parallel access, partitioning, sharding, and possibly sub-partitioning (bucketing?). Sounds familiar? Yes, it's data lake-compatible.

Regarding the hashes, they have a great benefit for the Big Data workloads. The loading pattern for Data Vault 1 required to load the hubs first, and later, when the links and satellites were loaded, make the lookups on the hubs table to retrieve the surrogate keys. It changes in the Data Vault 2 because the hash key is deterministic, ie. it can be resolved from already existing data, making the load parallel for all components:

Of course, this approach makes only sense where there are no referential integrity constraints. If it's enforced, you will have to load hubs, links and satellites in order.

Another Big Data-related change is about making some fields optional. An example of that is the load date time field representing when a given record arrived in the system. Not all platforms support it (in Spark you can use current_timestamp method) and was made optional in the new specification. Same, but not for the same reason, is valid for the update user or update timestamp fields representing who and when updated a given line for the last time in the landing zone. Despite the recent effort with ACID-compatible file formats (DeltaLake, Apache Hudi, Apache Iceberg), in-place changes are still costly for big volumes of data. That's why these 2 fields were made optional.

Finally, the already mentioned surrogate sequence keys are deprecated because they don't scale and aren't deterministic (eg. multi-region dataset split).

Architecture examples

All this is still related to the modeling part of Data Vault. Let's now see an example of the architecture. As for every data system, on the leftmost part of the schema, you can find the data sources and the ingestion process that, thanks to the landing zone (aka staging area) abstraction, can be performed in real-time.

An important point to notice is that at this first stage, you can only apply hard rules like reformatting the data. You can't apply the soft rules, which are the data-interpretation business rules. In consequence, the data must remain as close to the source format as possible.

The data from the landing zone will be loaded into the data warehouse layer, where you will apply all Data Vault modeling rules. But not only. This layer can contain 3 optional vaults. The first of them is Business Vault representing the raw vault with business rules applied. It's useful for the advanced users of the organizations who can do some smart things with the data on their own, without necessarily accessing it from BI reporting tools or the data exposition layer.

The second optional part is called Metrics Vault. That's the place where different statistics about ETL execution are recorded. The last part is about Operational Vault, and it's there to ensure faster access for the external systems that, for example, may need it to replicate the master dataset or reference tables definition to the vault.

Once all these vaults are ready, the data is processed with soft rules and exposed to the marts like classical data marts or error marts. The latter stores the ETL errors occurred during this exposition stage. There is also a meta mart that will contain the metadata about the raw vault (the one not processed with soft rules), parameters for business rules or backup logs.

To sum-up this architecture part, nothing better than a picture:

Just a quick summary, because sometimes you can find 2 incoming arrows. In the design proposals I found (links in Further reading section), some different approaches were proposed. Regarding the loading part, the staging area is optional, and if it's missing, the data is loaded directly to the raw data vault. The same flexibility applies to the data marts. You can either build them from a business vault or directly from the raw vault with a virtual data mart, which is nothing more than a view on top of the raw vault.

After the first post about Data Vault, you could think that it's only a modeling approach. It's not, and you can see it in this article. Despite a more detailed explanation for the Data Vault components, you can see that it's also an architectural approach that fits the continually evolving data world.