Optimized Load (eng)

In large applications, it is important to optimize data loading to shorten the time. One of the things you can use for this is optimized load which loads QVD file in a very efficient way.



It is very common for us to store information from various data sources in intermediate QVD files. A QVD file can contain a table with multiple columns and as many rows as needed. The purpose of a QVD file is both to customize the data source from multiple readings of the same information from several different apps and to be able to apply transformation logic in several steps.


When we then read data from a QVD file into an app, it can be read in two ways - optimized and non-optimized load. Optimized load is considerably faster but puts great demands on how the reading is done. In order for a data load to be optimized, we can only:

  • Filter data with a single exists() function

  • Rename columns

  • Read a subset of the columns

  • Keep/Join towards another table

  • Read a field several times

  • Read data distinct

Thus we can NOT:

  • Calculate a new column (e.g. ColumnA + 1 as Column3)

  • Use where statements in addition to a single exists() function

  • Use ApplyMap()

As an example here, we will test read a QVD file with 131 million lines & 13 columns (just about 4GB of data) for taxi journeys in New York. The version we are testing to load data into is QlikView Desktop April 2020 SR1.


To load data as optimized, we use:

Data:
LOAD VendorID, 
     tpep_pickup_datetime, 
     tpep_dropoff_datetime, 
     passenger_count, 
     trip_distance, 
     pickup_longitude, 
     pickup_latitude, 
     dropoff_longitude, 
     dropoff_latitude, 
     payment_type, 
     fare_amount, 
     tip_amount, 
     trip_time
FROM [F:\Temp\NY_ShortFull.qvd] (qvd);

You can see that the data load is optimized as it appears in the log for the loading:

Data << NY_ShortFull (row-based qvd optimized) 131,165,043 Lines fetched

To force the data load not to be optimized without adding any major complexity (which in itself impair the performance), we add a simple where statement:

Data:
LOAD VendorID, 
     tpep_pickup_datetime, 
     tpep_dropoff_datetime, 
     passenger_count, 
     trip_distance, 
     pickup_longitude, 
     pickup_latitude, 
     dropoff_longitude, 
     dropoff_latitude, 
     payment_type, 
     fare_amount, 
     tip_amount, 
     trip_time
FROM [F:\Temp\NY_ShortFull.qvd] (qvd) where 1=1;

For non-optimized load, nothing special appears in the log:

Data << NY_ShortFull 131,165,043 Lines fetched

So how much difference does it make in time for data loading? An optimized load took 27 seconds on my desktop client computer and for non-optimized, it took 62 seconds on the same machine - about twice as long.


Filter out values

So what do you do if you want to filter out certain values? Suppose we want to load data from the QVD file above but only include VendorID = 1 or 3. We cannot add a where statement as below as it would make the reading non-optimized:

<LOAD-sats>
where VendorID=1 or VendorID=3

However, we are allowed to use an exists() function. Exists() compares values with already loaded values in a field. We can use this by first load VendorID 1 and 3 in a field and then adding a where statement with exists ().

Load * INLINE [
    VendorID
    1
    3
];

<LOAD-sats>
where exists(VendorID);

In the first non-optimized case, the data load of 61 million lines takes 61 seconds and in the second case with the optimized load, it takes 20 seconds.


Saved values

Sometimes, with a certain frequency, we save data in snapshots, which are then merged in an application to see how something changes over time. These can be, for example, Work in Progress (PIP / WIP), stock balances and similar.


If we take snapshots of stock balances as an example, we can think about how best to calculate a valuation of the stock. A common simple way to calculate a stock value is to take the balance multiplied by the standard cost of each item.

StockBalance*StandardCost as StockValue

Now think about what we have learned above. Where should we do the calculation of StockValue? Should we do it in the QVD creator who saves a snapshot every month or in the application that loads all snapshots? Compare below:


StockValue is calculated when snapshots are stored:

QVD creator:

Let vToday = text(date(today(),'YYYYMMDD');
Stock:
    ArticleNo,
    StockBalance,
    StandardCost,
    StockBalance*StandardCost as StockValue
from Stock.qvd (qvd);
store Stock into Snapshot_Stock_$(vToday).qvd (qvd);

Finished application:

Load * resident Snapshot_Stock*.qvd (qvd);

StockValue is calculated when snapshots are loaded into the application:

QVD creator:

Let vToday = text(date(today(),'YYYYMMDD');
Stock:
    ArticleNo,
    StockBalance,
    StandardCost
from Stock.qvd (qvd);
store Stock into Snapshot_Stock_$(vToday).qvd (qvd);

Finished application:

Load
    *,
    StockBalance*StandardCost as StockValue
resident Snapshot_Stock*.qvd (qvd);

Which of the two cases above will go faster? The correct answer is when StockValue is saved in a snapshot. Since the field already exists in the QVD, the file can be read optimized. In the second case, we make a calculation when loading data, which makes the reading non-optimized.


Should everything be optimized?

So, does this mean that the best practice is that everything should be loaded optimized? No - assess what you have to gain from optimizing the data loading. It is probably unnecessary to spend a few hours optimizing something where you can win a minute - if you are not going to reload the app very frequently.


Written by: Morgan Kejerhag

Morgan Kejerhag has worked with the Qlik platform since 2005 and is one of Sweden's most experienced consultants. During the years, Morgans has worked with several multinational companies where he has led the work to build both small and large Qlik environments. LinkedIn Contact Information