Staging Layer
In the previous section, we listed the raw tables that we have in our database. Every data modeling journey begins with similar raw tables.
These tables contain the data that is needed for your application to function. Such tables often contain a lot of irregularities and inconsistencies in the way data is stored or represented. This irregularities are a result of the way data is collected and stored in the source systems.
This is where the staging layer comes in. The staging layer is used to clean and rename our columns so that they are consistent and easy to work with. This layer doesn’t contain any transformations or business logic, it’s just a copy of the raw tables with some cleaning and renaming.
Let us work through the raw tables in our database and create a staging layer for each of them.
We prefix our table names with stg_
to indicate that they are staging tables.
You can find all the code for creating these staging tables in the models/staging/product
directory
of the example-dbt-project.
Create Staging Tables
stg_users
The columns of our users
table need the following changes:
user_id
name
- Should be renamed touser_name
email_address
- Should be renamed touser_email
lat
- Should be renamed tolatitude
long
- Should be renamed tolongitude
createdAt
- Should be renamed tocreated_at
updatedAt
- Should be renamed toupdated_at
The following dbt model creates a staging table for the users
table.
stg_merchants
The columns of our merchants
table need the following changes:
merchant_id
merchant_name
geo_loc
- This is a custom datatype that stores “latitude/longitude” as a single value. We should split this into two columnslatitude
andlongitude
createdAt
- Should be renamed tocreated_at
updatedAt
- Should be renamed toupdated_at
The following dbt model creates a staging table for the merchants
table.
stg_items
The columns of our items
table need the following changes:
item_id
merchant_id
name
- Should be renamed toitem_name
because explicit is better than implicitcategory
- Should be renamed toitem_category
because explicit is better than implicitprice
- Should be renamed tobase_price_usd
so that it’s clear that the price is in USDcreated_at
updated_at
The following dbt model creates a staging table for the items
table.
stg_promotions
The columns of our promotions
table need the following changes:
promotion_id
product_id
- The use of the word ‘product’ is because of legacy reasons. We should rename this toitem_id
to match theitems
tablediscount
- By default we store the discount percentage as an integer e.g. If we offer 20% discount, this value is 20. We will convert this to a float by dividing by 100 to make it easier to work with. We will also rename it todiscount_percentage_as_float
created_at
updated_at
The following dbt model creates a staging table for the promotions
table.
stg_inventory
The columns of our inventory
table need the following changes:
inventory_id
merchant_id
item_id
qty
- Should be renamed toquantity
for claritydate_acquired
- Should be renamed toacquisition_date
because we keep the units as suffixescost
- Should be renamed tounit_cost_usd
so that it’s clear that the price is in USD and we’re talking about the cost of a single unitcreated_at
updated_at
We’ll also add a column total_cost_usd
which is the product of quantity
and unit_cost_usd
since this is a common calculation that we’ll need to do.
The following dbt model creates a staging table for the inventory
table.
stg_orders
The columns of our orders
table need the following changes:
order_id
customer_id
- Should be renamed touser_id
to match theusers
tablestore_id
- Should be renamed tomerchant_id
to match themerchants
tabledate
- Should be renamed toorder_placed_at
because it’s a timestamp and not a date, and we want to clarify that it’s the time when the order was placedcreated_at
updated_at
The following dbt model creates a staging table for the orders
table.
stg_order_items
The columns of our order_items
table need the following changes:
order_item_id
order_id
product_id
- Should be renamed toitem_id
to match theitems
tableqty
- Should be renamed toquantity
for claritycreated_at
updated_at
The following dbt model creates a staging table for the order_items
table.
Next Steps
In this section, we have created staging tables for each of the raw tables in our database. These staging tables are used to clean and rename our columns so that they are consistent and easy to work with.
The core of a data warehouse is made up of tables known as facts and dimensions. Facts represent actions or events, while dimensions represent the context in which these actions or events occur.
In the next section, we will use the snapshot
functionality in dbt to create snapshots of our staging tables.
These snapshots will allow us to track changes to our data over time and create a history of our data.
Such a table that contains historical records of your data is called a slowly changing dimension (SCD) table.
Once we have our SCD tables, we will create our fact tables.
Was this page helpful?