Good warehouse layers are the foundation of a well-designed data warehouse. They provide a clear separation of concerns between the different stages of your ETL process and make it easier to manage and maintain your data warehouse. Separating your data into layers also allows you to control who gets to see what data and improve data discoverability. As a general principle, you should at least have the following layers in your data warehouse. You should separate your staging, intermediate and consumption layer data tables into separate schemas.Documentation Index
Fetch the complete documentation index at: https://handbook.pragunbhutani.com/llms.txt
Use this file to discover all available pages before exploring further.
Source
Source layers can be of two types - Raw Data (with no transformation) or Staging (with minimal data cleaning and homogenisation).Raw Data
Contains replicas of data as it is at source, without any transformation or changes whatsoever. Table and column names should match those in the original source without any modifications. Create a separate namespace for each data source. Each production database or cluster should be considered a separate source. This layer should only be visible to builders of ETL pipelines and not to end users.Staging
Contains views on top of your raw data layer that contain the following transformations:- Combining partial data together into a single table. e.g. Combining active and deleted records.
- Rename columns to follow a consistent naming scheme across all your tables.
- Type cast columns for consistency or ease of use. e.g. converting all timestamps to UTC, all boolean columns to TRUE/FALSE instead of 1/0 etc.