Optimizing distinctcount with reduced memory consumption in PowerBI (VertiPaq)

High cardinality columns are often the problematic part of the data modeling in most environments, including PowerBI.

Since the PowerBI data model resides on a columnar database, having high cardinality columns impact both performance and the size of the model. This is especially the case with columns having more than 2 millions of distinct values. These columns can occupy almost half of the data model size and are hard to compute fast DAX code against them.


There are certain techniques to decrease the column cardinality. The most known are:

  • Rounding (numeric columns),
  • Splitting columns in low and high ID number (numeric columns),
  • Splitting based on char length (text columns).

All these techniques work well when the resulting measure needs to compute additive functions. But at the same time, they all fall short in case of non-additive ones, like the DISTINCTCOUNT.

In order to compute the distinct count of the column, basically, we have 2 options.

  • Leave the high cardinality column as is (meaning having a high increase in the data model and .pbix size, which leads to an overall degradation of the model performance),
  • Use one of the above-mentioned techniques for decreasing columns cardinality – this will for sure reduce the size of the data model and increase its overall performance, but at the same time, it will drastically decrease the performance of the DISTINCTCOUNT function over the split columns. Why? Because if we want to count the distinct values of the split columns, we must combine them in one column at the query time, which leads to large materialization happening every time we interact with a report. Queries with large materialization have a high impact on query speed which results in slow calculations. The formula needed to accomplish that is not hard to write. But its performance on our model was critically slow.
DistincCountSlow =
COUNTROWS ( SUMMARIZE ( table, table[LowID], table[highID] ) )

It resulted in 4 times slower execution speed compared to

DistincCount=DISTINCTCOUNT(table[InvoiceID])

In our test model with 22 mil rows in fact table, this led to the execution time of SUMMARIZE version to over 12sec. We estimate our full model will be around 80 mil rows, meaning execution time should increase to around 45 sec, which was unacceptable. On the other hand, if we were to go forward with keeping the high distinct values column as is w/o decreasing its cardinality (in the test model it had 7.5mil distinct values, so we estimated that the full model will have around 30mil), we were faced with the issue of the upload limit of the PowerBI dataset, alongside with the whole degradation of the model performance. DISTINCTCOUNT(table[InvoiceID]) is an important metric for our report, so removing it completely was also not an option. We were basically in the dead street. Until we tried the following technique.


OPTIMIZATION TECHNIQUE

Defining the problem

We were dealing with a high number of invoiceIDs generated for each new purchase. We were receiving around 100.000 invoices per day, with multiple items per invoice, each in a separate row (meaning we couldn’t simply use COUNTROWS function over the fact table).

Solution

The approach was simple. We needed to decrease the cardinality of the ID column, but at the same time avoid the expensive SUMMARIZE(table, table[smallID],table[highID]) function over the fact table.

The time granularity of the fact table was set on a day. And we knew for sure that the same InvoiceID can only occur in a one-day timeframe. What we did was, instead of letting each new purchase produce new ID (which led to >30 million of distinct values), we set the ID number to reset each day, starting from 0. This led to a decrease in the number of distinct values in ID column from 7.5 million to around 100k (max number of purchases in one day). The reduction of distinct values in the ID column resulted in a decrease of the whole model size by more than 30%!

Repeating ID technique


How to use this technique to get the right number of distinct values?

DISTINCTCOUNT of the new ID column will produce a wrong result since we are repeating the same IDs for each day. But if we create an iteration over each day, calculating a distinct for each day separately, and then summing that values at the end, we will get the right number.

What we need is a SUMX function over date, with DISTINCTCOUNT for each date.

The calculation is as follows:

DistinctCountNew =
SUMX (
    VALUES ( fact[date] ),
    CALCULATE ( DISTINCTCOUNT ( [fact[NewInvoiceID] ) )
)

We thought that the performance of the calculation would be slow, but that was not the case. There is an iteration over the date’s column, but in the case of 2 years of historical data, that is only 730 iterations.  For VertiPaq(95% of the calculation time is done in it), that is nothing. On the other hand, in each iteration, we are doing a distinct of maximum 100.000 values, which is considerably smaller than 30 million values in the case of DISTINCTCOUNT(fact[InvoiceID]).

We tested both techniques, and the results on our model were astonishing. We managed to reduce the .pbix file size from 219mb to 122 MB, at the same time reducing the data model (based on VERTIPAQ analyzer) by 32%. We didn’t have to sacrifice any performance for DISTINCTCOUNT and even got better performance on trended data, while we increased overall model performance by around 18%.

Comparison of the speed, model size and overall performance between three used techniques


Requirements for this technique

  1. ID column can only be used for DISTINCTCOUNT calculation. If you need this column to be a key column, or you wish to create other calculations besides distinct, that will not be possible with this approach.
  2. You will need a column on which you will be able to reset ID values for ID column (column needs to be the one for which, for each value in the column, only unique ID can occur). For our purpose the TimeID column was perfect.
  3. You will need to test this approach on a different combination of columns to get optimal performance under a different set of filters.

Hope this technique will help you all struggling with the high number of the distinct values in the fact tables. In our model, it resulted in a high decrease in model size and increase in the speed of the model(and also DISTINCT calculation). I cannot testify that this technique is suitable for all data models, you will have to test it to see how and if it fits in your data model.

Table of Content

Table of Content

LEARN POWER QUERY

Improve your PowerQuery skills with
Exceed Academy

COMMENTS

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments

GET LATEST BI NEWS TO YOUR INBOX

Receive the latest updates on all business analyst news across all platforms.

By subscribing you are agreeing to our Privacy Policy.

Related blog posts

DAX HANDBOOK
7. VARIABLES

Variables are used in almost every measure you will create. The reason we introduce them last is that they use all other parts of DAX code to produce faster, more powerful and maintainable code. Variables are like containers of a part of the DAX code which can be used throughout...

Read more

DAX HANDBOOK
6.8 ALLSELECTED

Explanation ALLSELECTED is one of the most complex functions in DAX. When used improperly it can lead to unexpected results. As a rule of thumb, you should not use it in iterative functions. It is preferable to use it only as a CALCULATE filter remover, not as a table function....

Read more

DAX HANDBOOK
6.7 Lineage

What is Lineage? Lineage is a part of DAX mechanics that enables us to use tables as filter arguments for the CALCULATE function. It’s also used in row2filter context transition and other aspects of the data model that involve filter propagation throughout the model. We can state that lineage is...

Read more

DAX HANDBOOK
6.6 Crossfilter

Crossfilter is a feature of DAX when it filters the underlying dataset even though there aren’t any visual filters present. Introduction In this example, we will explain a very important feature of CALCULATE filter arguments. We will also explain why you should always prefer the combination of ALL/REMOVEFILTER + VALUES...

Read more
0
Would love your thoughts, please comment.x
()
x
Scroll to Top