Codementor Events

How to take only the unique records from exports and how to handle a slowly changing dimensions in Qlik

Published Dec 19, 2019

How to take only the unique records from exports and how to handle a slowly changing dimensions in Qlik

Recently we had a project where we were supposed to use an extracted data from a system and to handle changes of records over time which might sounds seriously at the time of reading the requirements. However it is not and I will try to explain our approach handling it.

Starting from the exports (on a monthly basis) - those contained new records, corrected records, old records.
Each time we receive a new export file we were not able just to add (concatenate) the data but we have to check what we already have loaded and add only the new and changed records.
We need to store the date of the records as well.
E.g. if we have:
Dealer ID = 1, Dealer Category = A, [Date(H) Dealer] = 01/01/2020

  • if we have a changed record to
    Dealer ID = 1, Dealer Category = B, [Date(H) Dealer] = 01/05/2020
    where [Date(H) Dealer] is the date of each export file.
    In our case we need to store both records with the changed dates – here we have a slowly changing dimensions.

But how we can do it, how we can compare what is new, old, changed, … as we do not have an ID coming from the exports for each row?
The only option is to check all the records, to check if we have the combinations of all records across all the data we have extracted, e.g. we need to have a Row ID for each row.

As an example from the monthly exports we have:
Dealer ID = 1, Dealer Category = A, [Date(H) Dealer] = 01/01/2020
Dealer ID = 1, Dealer Category = A, [Date(H) Dealer] = 01/02/2020
Dealer ID = 1, Dealer Category = A, [Date(H) Dealer] = 01/03/2020
Dealer ID = 1, Dealer Category = A, [Date(H) Dealer] = 01/04/2020
Dealer ID = 1, Dealer Category = B, [Date(H) Dealer] = 01/05/2020
Dealer ID = 1, Dealer Category = B, [Date(H) Dealer] = 01/06/2020

In May we have the category of Dealer ID 1 changed from A to B.

So overall in our Qlik table we should have only the unique records in order not to store duplicated records but only those with the minimum date the records which were present.
The result should look like:
Dealer ID = 1, Dealer Category = A, [Date(H) Dealer] = 01/01/2020
Dealer ID = 1, Dealer Category = B, [Date(H) Dealer] = 01/05/2020

The script below shows what is the technique to do that.

Dealer_tmp:
LOAD
AutoNumberHash128("Dealer ID",
"Dealer Name",
"Dealer VAT Nmbr",
"Dealer Region",
"Dealer Category") as [%ID Row],
date(date#(right(FileBaseName(),8), 'YYYYMMDD')) as [Date(H) Dealer],
text("Dealer ID") as "Dealer ID",
text("Dealer Name") as "Dealer Name",
text("Dealer VAT Nmbr") as "Dealer VAT Nmbr",
text("Dealer Region") as "Dealer Region",
text("Dealer Category") as "Dealer Category"
FROM $(vG.ImportPath)\Dealer*.txt (txt, codepage is 28591, embedded labels, delimiter is '\t', msq);

Dealer_Distinct:
NoConcatenate
LOAD Distinct
[%ID Row],
"Dealer ID",
"Dealer Name",
"Dealer VAT Nmbr",
"Dealer Region",
"Dealer Category"
Resident Dealer_tmp;

Left Join (Dealer_Distinct)
Load
[%ID Row],
min([Date(H) Dealer]) as [Date(H) Dealer]
Resident Dealer_tmp
Group by [%ID Row];
Drop Table Dealer_tmp;
Store Dealer_Distinct Into $(vG.QVDPath)Dealer.qvd (qvd);
Drop Tables Dealer_Distinct;

In the first interaction we load the data from the exports as is (+formatting the fields – text or num).
By the following script
AutoNumberHash128("Dealer ID",
"Dealer Name",
"Dealer VAT Nmbr",
"Dealer Region",
"Dealer Category") as [%ID Row]
we load all the fields into one field which we call [%ID Row]
The function AutoNumberHash takes the field values and provides a unique numeric ID per each field combination – this is the unique record ID per each row.
In the first interaction we also take the date from the file name by using the following clause - date(date#(right(FileBaseName(),8), 'YYYYMMDD')) as [Date(H) Dealer]

For the second interaction we load only the distinct rows without the date of the record.
Dealer_Distinct:
NoConcatenate
LOAD Distinct
[%ID Row],
"Dealer ID",
"Dealer Name",
"Dealer VAT Nmbr",
"Dealer Region",
"Dealer Category"
Resident Dealer_tmp;
Drop Table Dealer_tmp;
Store Dealer_Distinct Into $(vG.QVDPath) Dealer.qvd (qvd);
Drop Tables Dealer_Distinct;

On the third interaction we join to Dealer_Distinct table on the [%ID Row] the minimum [Date(H) Dealer] date. We group on the [%ID Row] which is the ID of each row.
Here is the script:
Left Join (Dealer_Distinct)
Load
[%ID Row],
min([Date(H) Dealer]) as [Date(H) Dealer],
min([Date(H) Dealer]) as FromDate
Resident Dealer_tmp
Group by [%ID Row];

This way we have only the first row in time where we have this data combination:
Dealer ID = 1, Dealer Category = A, [Date(H) Dealer] = 01/01/2020
And the first row in time where we have the next change.
Dealer ID = 1, Dealer Category = B, [Date(H) Dealer] = 01/05/2020

Why we need to have it this way?
First we store only the unique records and second we can easily build our table in qlik to serve the slowly changing dimension technique.
As we already have the dates when the record occurs now we have to construct a table where we should have intervals.
E.g we need the following:
Dealer ID = 1, Dealer Category = A, [Date(H) Dealer] = 01/01/2020, From Date = 01/01/2020, To Date 30/04/2020
Dealer ID = 1, Dealer Category = B, [Date(H) Dealer] = 01/05/2020, From Date = 01/05/2019, To Date Today

The intervals will help us playing with the slowly changing dimensions.
Actually what are slowly changing dimensions? In our example it is the Dealer Category over time, it was A then B. Point in time reporting shows what was the situation back in time. So in order to represent it the easiest way is to have a bridge table between our master calendar and our dealer data in our case.
The bridge table will contain the single date from our calendar and the time periods where each date belongs to.

For example:

From Date To Date Date
01/01/2020 30/04/2020 Jan 2020
01/01/2020 30/04/2020 Feb 2020
01/01/2020 30/04/2020 Mar 2020
01/01/2020 30/04/2020 Apr 2020
01/05/2020 Today May 2020
01/05/2020 Today June 2020

Below script creates the bridge table:
Let vEndTime = Num(Now());
Let vEpsilon = Pow(2,-27);

Dealer:
NoConcatenate
Load
*,
Date(If( "Dealer ID"=Peek("Dealer ID"), Peek(FromDate) - $(#vEpsilon), (#vEndTime) )) as ToDate From [(vG.QVDPath) Dealer.qvd] (qvd);
Order By "Dealer ID", FromDate Desc;

Bridge:
IntervalMatch ( [% Date] )
LOAD FromDate, ToDate
Resident Dealer;

Using the function IntervalMatch we can link the [%Date] field from our calendar table to the corresponding intervals from our Dealer table.

In this case we have a “big” bridge table but a “small” dimensional or transactional data. E.g it saves RAM usage and creates quite simplified method to handle the slowly changing dimensions case.

As a conclusion – Qlik loves simplicity! With simple tricks you can handle quite challenging requirements and cases, this is one of the main advantage points of Qlik compared to the rest of the BI tools on the market.

Discover and read more posts from Ivan Gadzhonov
get started