Skip to content
Weißes virtual7-Logo auf grünem Hintergrund mit stilisierter Sieben im Hintergrund.

Development & Consulting

DWH

Finance

Persistant Staging Area

Björn Möllers

Copy Link

Link copied

In this article, I would like to introduce the Persistant Staging Area – PSA for short – which Roelant Vos describes in his blog. This is used successfully in projects that we support.

DWH Base

A data warehouse (DWH) is often divided into three or four basic layers. The staging area is the first layer and takes the data 1:1 from the source systems. In the optional cleansing area, data is filtered, cleansed and checked before it is transferred to the third layer, the core. The core is responsible for the integration and historization of data within a DWH. In the fourth layer, the data marts, parts of the data are usually made available for specific user groups, e.g. sales, logistics, etc.

DWH Base Architecture

Figure 1: Basic data warehouse architecture

Stage and PSA

The staging layer can in turn be divided into a “landing area” (original staging area, or stage for short) and a “persistent area”.

The landing area takes over the stage’s original task of forming the interface to the source systems and extracting data from them via delta load.

The persistent staging area, on the other hand, stores the raw data from the source system for the long term and only receives data via the upstream landing area. The table structure of the PSA does not change compared to the landing area.

DWH with Persistant Staging Area

Figure 2: Basic data warehouse architecture with division of the stage into landing area and PSA

Why a PSA?

The historization of the raw data from the source system creates an easily manageable archive of the source tables. If necessary, the entire core can be recreated or reloaded from this archive without having to fall back on the source systems. Historization in the core can also be restored 1:1 by the PSA.

The PSA also enables DWH developers and administrators to precisely track the history of individual data records. The processing of records from the PSA to the core can be tracked more quickly. This option has often proven to be very helpful when working with customers.

Finally, the PSA can also serve as an Operational Data Store (ODS) for data analytics methods.

On the other hand, additional ETL routes and an increased storage volume for the DWH must be accepted.

Implementation

The table structure of the PSA, the stage and the source system are identical, except for the attributes to be added in the DWH. The source data is therefore available 1:1 in the PSA. As usual, foreign key relationships are removed in the entire stage layer.

Historization is carried out using the “Slowly Changing Dimension Type 2” (SCD2) method. To make this possible, the following attributes must be added for each PSA table:

  • Primary key – consisting of ID and loading date
  • Unique ETL process ID – e.g. DWH_Lade_ID, for reconciliation between stage and PSA
Table structure: Source, Stage, PSA

Figure 3: Table structure of source, stage and PSA: Example “Customer” table

In this example, the DWH_LADE_ID and DWH_GUELTIG_VON fields have been added to the stage. In the PSA, DWH_GUELTIG_BIS and DWH_AKTIV have been added in addition to the additional attributes of the stage. Although these are not absolutely necessary for an SCD2-ETL load, they are useful for comprehensibility and evaluability by the developers.

The delta load takes place from source to stage with the help of the ANALAGE TIME and MODIFICATION TIME fields.

From stage to PSA via SCD2 Delta Load, using DWH_GUELTIG_VON and CUSTOMER ID from the stage table. A distinction is made between the following cases:

  1. New data record –> insert
  2. Existing data record and changes –> insert and “close” old record
  3. Existing data record, no changes –> do nothing

Advantage: Automation

The simple structure enables the automatic generation of landing and PSA tables based on the metadata from the sources. The ETL process from source to PSA can also be generated fully automatically. This compensates for the disadvantage of additional ETL routes.

Sources

http://roelantvos.com/blog/

https://www.kimballgroup.com/


cPersistant Staging Area
Data Warehouse
Data Warehouse Architeture
Staging Area

Warning: file_exists(): open_basedir restriction in effect. File(action-scheduler-en_US.mo) is not within the allowed path(s): (/var/www/vhosts/digitalewege.de/:/tmp/) in /var/www/vhosts/digitalewege.de/v7-stage.digitalewege.de/virtual7de_alphaalpaka_2024/web/app/plugins/wpml-string-translation/classes/MO/Hooks/LoadTranslationFile.php on line 82

Warning: file_exists(): open_basedir restriction in effect. File(action-scheduler-en_US.l10n.php) is not within the allowed path(s): (/var/www/vhosts/digitalewege.de/:/tmp/) in /var/www/vhosts/digitalewege.de/v7-stage.digitalewege.de/virtual7de_alphaalpaka_2024/web/app/plugins/wpml-string-translation/classes/MO/Hooks/LoadTranslationFile.php on line 85

Warning: realpath(): open_basedir restriction in effect. File(/) is not within the allowed path(s): (/var/www/vhosts/digitalewege.de/:/tmp/) in /var/www/vhosts/digitalewege.de/v7-stage.digitalewege.de/virtual7de_alphaalpaka_2024/web/wp/wp-includes/l10n/class-wp-translation-controller.php on line 106