Optimizing distinctcount with reduced memory consumption in PowerBI (VertiPaq)

High cardinality columns are often the problematic part of the data modeling in most environments, including PowerBI.

Since the PowerBI data model resides on a columnar database, having high cardinality columns impact both performance and the size of the model. This is especially the case with columns having more than 2 millions of distinct values. These columns can occupy almost half of the data model size and are hard to compute fast DAX code against them.


There are certain techniques to decrease the column cardinality. The most known are:

  • Rounding (numeric columns),
  • Splitting columns in low and high ID number (numeric columns),
  • Splitting based on char length (text columns).

All these techniques work well when the resulting measure needs to compute additive functions. But at the same time, they all fall short in case of non-additive ones, like the DISTINCTCOUNT.

In order to compute the distinct count of the column, basically, we have 2 options.

  • Leave the high cardinality column as is (meaning having a high increase in the data model and .pbix size, which leads to an overall degradation of the model performance),
  • Use one of the above-mentioned techniques for decreasing columns cardinality – this will for sure reduce the size of the data model and increase its overall performance, but at the same time, it will drastically decrease the performance of the DISTINCTCOUNT function over the split columns. Why? Because if we want to count the distinct values of the split columns, we must combine them in one column at the query time, which leads to large materialization happening every time we interact with a report. Queries with large materialization have a high impact on query speed which results in slow calculations. The formula needed to accomplish that is not hard to write. But its performance on our model was critically slow.
DistincCountSlow =
COUNTROWS ( SUMMARIZE ( table, table[LowID], table[highID] ) )

It resulted in 4 times slower execution speed compared to

DistincCount=DISTINCTCOUNT(table[InvoiceID])

In our test model with 22 mil rows in fact table, this led to the execution time of SUMMARIZE version to over 12sec. We estimate our full model will be around 80 mil rows, meaning execution time should increase to around 45 sec, which was unacceptable. On the other hand, if we were to go forward with keeping the high distinct values column as is w/o decreasing its cardinality (in the test model it had 7.5mil distinct values, so we estimated that the full model will have around 30mil), we were faced with the issue of the upload limit of the PowerBI dataset, alongside with the whole degradation of the model performance. DISTINCTCOUNT(table[InvoiceID]) is an important metric for our report, so removing it completely was also not an option. We were basically in the dead street. Until we tried the following technique.


OPTIMIZATION TECHNIQUE

Defining the problem

We were dealing with a high number of invoiceIDs generated for each new purchase. We were receiving around 100.000 invoices per day, with multiple items per invoice, each in a separate row (meaning we couldn’t simply use COUNTROWS function over the fact table).

Solution

The approach was simple. We needed to decrease the cardinality of the ID column, but at the same time avoid the expensive SUMMARIZE(table, table[smallID],table[highID]) function over the fact table.

The time granularity of the fact table was set on a day. And we knew for sure that the same InvoiceID can only occur in a one-day timeframe. What we did was, instead of letting each new purchase produce new ID (which led to >30 million of distinct values), we set the ID number to reset each day, starting from 0. This led to a decrease in the number of distinct values in ID column from 7.5 million to around 100k (max number of purchases in one day). The reduction of distinct values in the ID column resulted in a decrease of the whole model size by more than 30%!

Repeating ID technique


How to use this technique to get the right number of distinct values?

DISTINCTCOUNT of the new ID column will produce a wrong result since we are repeating the same IDs for each day. But if we create an iteration over each day, calculating a distinct for each day separately, and then summing that values at the end, we will get the right number.

What we need is a SUMX function over date, with DISTINCTCOUNT for each date.

The calculation is as follows:

DistinctCountNew =
SUMX (
    VALUES ( fact[date] ),
    CALCULATE ( DISTINCTCOUNT ( [fact[NewInvoiceID] ) )
)

We thought that the performance of the calculation would be slow, but that was not the case. There is an iteration over the date’s column, but in the case of 2 years of historical data, that is only 730 iterations.  For VertiPaq(95% of the calculation time is done in it), that is nothing. On the other hand, in each iteration, we are doing a distinct of maximum 100.000 values, which is considerably smaller than 30 million values in the case of DISTINCTCOUNT(fact[InvoiceID]).

We tested both techniques, and the results on our model were astonishing. We managed to reduce the .pbix file size from 219mb to 122 MB, at the same time reducing the data model (based on VERTIPAQ analyzer) by 32%. We didn’t have to sacrifice any performance for DISTINCTCOUNT and even got better performance on trended data, while we increased overall model performance by around 18%.

Comparison of the speed, model size and overall performance between three used techniques


Requirements for this technique

  1. ID column can only be used for DISTINCTCOUNT calculation. If you need this column to be a key column, or you wish to create other calculations besides distinct, that will not be possible with this approach.
  2. You will need a column on which you will be able to reset ID values for ID column (column needs to be the one for which, for each value in the column, only unique ID can occur). For our purpose the TimeID column was perfect.
  3. You will need to test this approach on a different combination of columns to get optimal performance under a different set of filters.

Hope this technique will help you all struggling with the high number of the distinct values in the fact tables. In our model, it resulted in a high decrease in model size and increase in the speed of the model(and also DISTINCT calculation). I cannot testify that this technique is suitable for all data models, you will have to test it to see how and if it fits in your data model.

Table of Content

Table of Content

LEARN POWER QUERY

Improve your PowerQuery skills with
Exceed Academy

COMMENTS

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments

GET LATEST BI NEWS TO YOUR INBOX

Receive the latest updates on all business analyst news across all platforms.

By subscribing you are agreeing to our Privacy Policy.

Related blog posts

Advanced Excel training for client Hilding Anders

We have started our autumn training sessions! We are starting with a dear client from Međimurje, Hilding Anders, where we’ll deliver an advanced Excel training for the logistics and production departments. The training will last a total of 16 hours, split into 2-hour sessions. Advanced Excel is useful for all...

Read more

Napredna Excel edukacija za klijenta Hilding Anders

Krenuli smo s jesenskim edukacijama! Edukacije započinjemo kod dragog klijenta iz Međimurja, Hilding Anders, gdje ćemo za odjele logistike i proizvodnje proći kroz naprednu Excel edukaciju. Edukacija će trajati sveukupno 16 sati u terminima po 2 sata. Napredni Excel koristan je svim polaznicima koji svakodnevno rade sa velikom količinom podataka,...

Read more

Business improvement with the help of Power BI

How did we improve our client’s business with the help of Power BI? A global brand with a branch in the Adriatic region had several problems: • Uncollected receivables• Sales without margin (price lists in disarray)• Hidden losses• Unprofitable product lines and stockpiling• Lack of focus in sales and procurement,...

Read more

Unaprijeđenje poslovanja pomoću Power BI-ja

Kako smo uz pomoć Power BI-ja unaprijedili poslovanje našeg klijenta? Globalni brand s podružnicom u Adriatic regiji je imao više problema: Nenaplaćena potraživanja Prodaja bez marže (cjenici u rasulu) Skriveni gubici Neprofitne linije proizvoda i gomilanje zaliha Nije bilo fokusa u prodaji i nabavi već se sve obavljalo stihijski Zašto...

Read more
0
Would love your thoughts, please comment.x
()
x
Scroll to Top