Snapshots
So far in our data modeling journey, we’ve looked at the schema of the source tables and created a staging layer to clean and rename the columns. We shall now begin working on creating the core layer of a data warehouse that consists of facts and dimensions.
Facts and dimensions are the building blocks of a Kimball-style data warehouse.
- Facts are measurements, metrics or actions that represent business processes. They are typically numeric values that can be aggregated. Examples of facts include sales revenue, quantity sold, and the number of orders placed.
- Dimensions provide context to facts. They are descriptive attributes that help to understand the facts. Examples of dimensions include time, location, and product.
This warehouse architecture is also known as a star schema. You can find detailed information about the start schema on the Kimball Group website.
We will first create the dimension tables and then move on to the fact tables. In order to create our dimension tables,
we need to keep track of changes to our data over time. This is where the snapshot
functionality in dbt comes in handy.
What is a Snapshot?
The snapshot functionality in dbt compares the current state of a table with its previous state and creates a new row
every time it detects a change. This allows us to track changes to our data over time and create a history of our data.
dbt offers two strategies to compare the current state of a table with its previous state: timestamp
and check_cols
.
- Timestamp: This strategy compares the current state of a table with its previous state based on a
timestamp column - usually an
updated_at
column. If the timestamp column is updated, dbt creates a new row in the snapshot table. - Check_cols: This strategy compares the current state of a table with its previous state based on the values
of specific columns. If any of the columns specified in the
check_cols
configuration are updated, dbt creates a new row in the snapshot table.
dbt recommends the timestamp
strategy for most use cases. The check_cols
strategy can be used when you don’t have a
timestamp column in your table, however it is less performant than the timestamp
strategy.
Example
The follow example was taken from dbt documentation.
Imagine you have an orders
table with the following columns:
id | status | updated_at |
---|---|---|
1 | pending | 2019-01-01 |
2 | pending | 2019-01-01 |
Now, imagine that the order goes from “pending” to “shipped”. That same record will now look like:
id | status | updated_at |
---|---|---|
1 | shipped | 2019-01-02 |
2 | pending | 2019-01-01 |
This order is now in the “shipped” state, but we’ve lost the information about when the order was last in the “pending” state. This makes it difficult (or impossible) to analyze how long it took for an order to ship. dbt can “snapshot” these changes to help you understand how values in a row change over time.
Here’s an example of a snapshot table for the previous example:
id | status | updated_at | valid_from | valid_to |
---|---|---|---|---|
1 | pending | 2019-01-01 | 2019-01-01 | 2019-01-02 |
1 | shipped | 2019-01-02 | 2019-01-02 | null |
2 | pending | 2019-01-01 | 2019-01-01 | null |
Creating Snapshots
Let us now create snapshots of our staging tables. Snapshot tables are named with the prefix snap_
followed by the name of the model being snapshotted.
For example, the snapshot table for the stg_users
table would be named snap_stg_users
.
You can find all the code for creating these snapshots in the snapshots/
directory
of the example-dbt-project.
Users
Merchants
Items
Promotions
Next Steps
In this section, we created snapshot tables for our staging tables. These snapshot tables will help us track changes to our data and serve as the basis for our dimension tables.
In the next section, we will use these snapshots to create the dimension tables for our warehouse.
Was this page helpful?