Share on facebook
Share on twitter
Share on linkedin

Optimizing distinctcount with reduced memory consumption in PowerBI (VertiPaq)

Krešimir Ledinski

Krešimir Ledinski

Microsoft certified expert in the field of Business Intelligence. His biggest passions are DAX, M, and data modeling.

High cardinality columns are often the problematic part of the data modeling in most environments, including PowerBI.

Since the PowerBI data model resides on a columnar database, having high cardinality columns impact both performance and the size of the model. This is especially the case with columns having more than 2 millions of distinct values. These columns can occupy almost half of the data model size and are hard to compute fast DAX code against them.


There are certain techniques to decrease the column cardinality. The most known are:

  • Rounding (numeric columns),
  • Splitting columns in low and high ID number (numeric columns),
  • Splitting based on char length (text columns).

All these techniques work well when the resulting measure needs to compute additive functions. But at the same time, they all fall short in case of non-additive ones, like the DISTINCTCOUNT.

In order to compute the distinct count of the column, basically, we have 2 options.

  • Leave the high cardinality column as is (meaning having a high increase in the data model and .pbix size, which leads to an overall degradation of the model performance),
  • Use one of the above-mentioned techniques for decreasing columns cardinality – this will for sure reduce the size of the data model and increase its overall performance, but at the same time, it will drastically decrease the performance of the DISTINCTCOUNT function over the split columns. Why? Because if we want to count the distinct values of the split columns, we must combine them in one column at the query time, which leads to large materialization happening every time we interact with a report. Queries with large materialization have a high impact on query speed which results in slow calculations. The formula needed to accomplish that is not hard to write.
    But its performance on our model was critically slow. It resulted in 4 times slower execution speed compared to

In our test model with 22 mil rows in fact table, this led to the execution time of SUMMARIZE version to over 12sec. We estimate our full model will be around 80 mil rows, meaning execution time should increase to around 45 sec, which was unacceptable. On the other hand, if we were to go forward with keeping the high distinct values column as is w/o decreasing its cardinality (in the test model it had 7.5mil distinct values, so we estimated that the full model will have around 30mil), we were faced with the issue of the upload limit of the PowerBI dataset, alongside with the whole degradation of the model performance. DISTINCTCOUNT(table[InvoiceID]) is an important metric for our report, so removing it completely was also not an option. We were basically in the dead street. Until we tried the following technique.


OPTIMIZATION TECHNIQUE

Defining the problem

We were dealing with a high number of invoiceIDs generated for each new purchase. We were receiving around 100.000 invoices per day, with multiple items per invoice, each in a separate row (meaning we couldn’t simply use COUNTROWS function over the fact table).

Solution

The approach was simple. We needed to decrease the cardinality of the ID column, but at the same time avoid the expensive SUMMARIZE(table, table[smallID],table[highID]) function over the fact table.

The time granularity of the fact table was set on a day. And we knew for sure that the same InvoiceID can only occur in a one-day timeframe. What we did was, instead of letting each new purchase produce new ID (which led to >30 million of distinct values), we set the ID number to reset each day, starting from 0. This led to a decrease in the number of distinct values in ID column from 7.5 million to around 100k (max number of purchases in one day). The reduction of distinct values in the ID column resulted in a decrease of the whole model size by more than 30%!

Repeating ID technique


How to use this technique to get the right number of distinct values?

DISTINCTCOUNT of the new ID column will produce a wrong result since we are repeating the same IDs for each day. But if we create an iteration over each day, calculating a distinct for each day separately, and then summing that values at the end, we will get the right number.

What we need is a SUMX function over date, with DISTINCTCOUNT for each date.

The calculation is as follows:

We thought that the performance of the calculation would be slow, but that was not the case. There is an iteration over the date’s column, but in the case of 2 years of historical data, that is only 730 iterations.  For VertiPaq(95% of the calculation time is done in it), that is nothing. On the other hand, in each iteration, we are doing a distinct of maximum 100.000 values, which is considerably smaller than 30 million values in case of DISTINCTCOUNT(fact[InvoiceID]).

We tested both techniques, and the results on our model were astonishing. We managed to reduce the .pbix file size from 219mb to 122 MB, at the same time reducing the data model (based on VERTIPAQ analyzer) by 32%. We didn’t have to sacrifice any performance for DISTINCTCOUNT and even got better performance on trended data, while we increased overall model performance by around 18%.

Comparison of the speed, model size and overall performance between three used techniques


Requirements for this technique

  1. ID column can only be used for DISTINCTCOUNT calculation. If you need this column to be a key column, or you wish to create other calculations besides distinct, that will not be possible with this approach.
  2. You will need a column on which you will be able to reset ID values for ID column (column needs to be the one for which, for each value in the column, only unique ID can occur). For our purpose the TimeID column was perfect.
  3. You will need to test this approach on a different combination of columns to get optimal performance under a different set of filters.

Hope this technique will help you all struggling with the high number of the distinct values in the fact tables. In our model, it resulted in a high decrease in model size and increase in the speed of the model(and also DISTINCT calculation). I cannot testify that this technique is suitable for all data models, you will have to test it to see how and if it fits in your data model.

Share on facebook
Share on linkedin
Share on twitter

LEARN POWER QUERY

Learn and improve your PowerQuery skills with
Exceed Academy

COMMENTS

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments

GET LATEST BI NEWS TO YOUR INBOX

Receive the latest updates on all business analyst news across all platforms.

By subscribing you are agreeing to our Privacy Policy.

Related blog posts

Sorting in Matrix created with calculation groups (PowerBI)

This is the second part of the blog series dedicated to showing workarounds for currently unsupported matrix visual features. In the first part, we explained how to add conditional formatting only for the specified columns, and in this article, we will explain how to introduce sorting. Below, you can test...

Read more

How to track changes in the table in Power Query

If you ever had a need to check if anything changed in the new table compared to the previous export, no matter in which column, then this is the article for you. We have created an M script in Excel that uses the left anti join to check for data...

Read more

Split packs into single products in Power Query

This is going to be an easy to follow article on how to use PowerQuery to easily change source data granularity for only a subset of data. We used a client’s request to split packs into its components as an inspiration to write this post. You can also download Excel...

Read more
0
Would love your thoughts, please comment.x
()
x
Scroll to Top