Optimizing distinctcount with reduced memory consumption in PowerBI (VertiPaq)

Krešimir Ledinski

Krešimir Ledinski

Microsoft certified expert in the field of Business Intelligence. His biggest passions are DAX, M, and data modeling.

High cardinality columns are often the problematic part of the data modeling in most environments, including PowerBI.

Since the PowerBI data model resides on a columnar database, having high cardinality columns impact both performance and the size of the model. This is especially the case with columns having more than 2 millions of distinct values. These columns can occupy almost half of the data model size and are hard to compute fast DAX code against them.

There are certain techniques to decrease the column cardinality. The most known are:

  • Rounding (numeric columns),
  • Splitting columns in low and high ID number (numeric columns),
  • Splitting based on char length (text columns).

All these techniques work well when the resulting measure needs to compute additive functions. But at the same time, they all fall short in case of non-additive ones, like the DISTINCTCOUNT.

In order to compute the distinct count of the column, basically, we have 2 options.

  • Leave the high cardinality column as is (meaning having a high increase in the data model and .pbix size, which leads to an overall degradation of the model performance),
  • Use one of the above-mentioned techniques for decreasing columns cardinality – this will for sure reduce the size of the data model and increase its overall performance, but at the same time, it will drastically decrease the performance of the DISTINCTCOUNT function over the split columns. Why? Because if we want to count the distinct values of the split columns, we must combine them in one column at the query time, which leads to large materialization happening every time we interact with a report. Queries with large materialization have a high impact on query speed which results in slow calculations. The formula needed to accomplish that is not hard to write.
    But its performance on our model was critically slow. It resulted in 4 times slower execution speed compared to

In our test model with 22 mil rows in fact table, this led to the execution time of SUMMARIZE version to over 12sec. We estimate our full model will be around 80 mil rows, meaning execution time should increase to around 45 sec, which was unacceptable. On the other hand, if we were to go forward with keeping the high distinct values column as is w/o decreasing its cardinality (in the test model it had 7.5mil distinct values, so we estimated that the full model will have around 30mil), we were faced with the issue of the upload limit of the PowerBI dataset, alongside with the whole degradation of the model performance. DISTINCTCOUNT(table[InvoiceID]) is an important metric for our report, so removing it completely was also not an option. We were basically in the dead street. Until we tried the following technique.


Defining the problem

We were dealing with a high number of invoiceIDs generated for each new purchase. We were receiving around 100.000 invoices per day, with multiple items per invoice, each in a separate row (meaning we couldn’t simply use COUNTROWS function over the fact table).


The approach was simple. We needed to decrease the cardinality of the ID column, but at the same time avoid the expensive SUMMARIZE(table, table[smallID],table[highID]) function over the fact table.

The time granularity of the fact table was set on a day. And we knew for sure that the same InvoiceID can only occur in a one-day timeframe. What we did was, instead of letting each new purchase produce new ID (which led to >30 million of distinct values), we set the ID number to reset each day, starting from 0. This led to a decrease in the number of distinct values in ID column from 7.5 million to around 100k (max number of purchases in one day). The reduction of distinct values in the ID column resulted in a decrease of the whole model size by more than 30%!

Repeating ID technique

How to use this technique to get the right number of distinct values?

DISTINCTCOUNT of the new ID column will produce a wrong result since we are repeating the same IDs for each day. But if we create an iteration over each day, calculating a distinct for each day separately, and then summing that values at the end, we will get the right number.

What we need is a SUMX function over date, with DISTINCTCOUNT for each date.

The calculation is as follows:

We thought that the performance of the calculation would be slow, but that was not the case. There is an iteration over the date’s column, but in the case of 2 years of historical data, that is only 730 iterations.  For VertiPaq(95% of the calculation time is done in it), that is nothing. On the other hand, in each iteration, we are doing a distinct of maximum 100.000 values, which is considerably smaller than 30 million values in the case of DISTINCTCOUNT(fact[InvoiceID]).

We tested both techniques, and the results on our model were astonishing. We managed to reduce the .pbix file size from 219mb to 122 MB, at the same time reducing the data model (based on VERTIPAQ analyzer) by 32%. We didn’t have to sacrifice any performance for DISTINCTCOUNT and even got better performance on trended data, while we increased overall model performance by around 18%.

Comparison of the speed, model size and overall performance between three used techniques

Requirements for this technique

  1. ID column can only be used for DISTINCTCOUNT calculation. If you need this column to be a key column, or you wish to create other calculations besides distinct, that will not be possible with this approach.
  2. You will need a column on which you will be able to reset ID values for ID column (column needs to be the one for which, for each value in the column, only unique ID can occur). For our purpose the TimeID column was perfect.
  3. You will need to test this approach on a different combination of columns to get optimal performance under a different set of filters.

Hope this technique will help you all struggling with the high number of the distinct values in the fact tables. In our model, it resulted in a high decrease in model size and increase in the speed of the model(and also DISTINCT calculation). I cannot testify that this technique is suitable for all data models, you will have to test it to see how and if it fits in your data model.

Table of Content

Table of Content


Improve your PowerQuery skills with
Exceed Academy


Notify of
Inline Feedbacks
View all comments


Receive the latest updates on all business analyst news across all platforms.

By subscribing you are agreeing to our Privacy Policy.

Related blog posts

Dynamic Coalesce in PowerQuery

We can define coalesce in PowerQuery as the last value existing in a single row of multiple columns positioned from left to right. In our example that would be the last value from the month columns. In this article, we will show you how to create both static and dynamic...

Read more

Dynamic bulk insert to Power Apps collection

In one of our latest blog posts, we described a technique used to copy tables from other programs (e.g. Excel) into PowerApps collections. This solution had a drawback. Within the code, the column order was referenced to determine which column to collect. This means that, for the first column in...

Read more

Bulk insert to Power Apps collection

In this blog, we will show you the technique used to collect tables for your app by using the bulk insert. We will show you how to make the same table structure in Power Apps as Excel tables. We will use Excel to show you how to copy-paste tabular data...

Read more

Why should you start using Power Apps?

In this article, we will take a look at what is Power Apps, why should we use it, and when organizations can benefit from custom apps created in the Power Apps platform.Many organizations are not yet familiarized with Power Apps capabilities, and the ways it can improve their agility, modernize and automate manual processes.   What is Power...

Read more
Would love your thoughts, please comment.x
Scroll to Top