The APPROX_COUNT_DISTINCT Function – A Test Case

The aim of this post is not to explain how the APPROX_COUNT_DISTINCT function works (you find basic information in the documentation and in this post written by Luca Canali), but to show you the results of a test case I run to assess how well it works.

Here’s what I did…

I created a table with several numerical columns (the name of the column shows how many distinct values it contains), loaded 100 million rows into it (the size of the segment is 12.7 GB), and gathered the object statistics.

SQL> CREATE TABLE t
  2  AS
  3  WITH
  4    t1000 AS (SELECT /*+ materialize */ rownum AS n
  5              FROM dual
  6              CONNECT BY level <= 1E3)
  7  SELECT rownum AS id,
  8         mod(rownum,2) AS n_2,
  9         mod(rownum,4) AS n_4,
 10         mod(rownum,8) AS n_8,
 11         mod(rownum,16) AS n_16,
 12         mod(rownum,32) AS n_32,
 13         mod(rownum,64) AS n_64,
 14         mod(rownum,128) AS n_128,
 15         mod(rownum,256) AS n_256,
 16         mod(rownum,512) AS n_512,
 17         mod(rownum,1024) AS n_1024,
 18         mod(rownum,2048) AS n_2048,
 19         mod(rownum,4096) AS n_4096,
 20         mod(rownum,8192) AS n_8192,
 21         mod(rownum,16384) AS n_16384,
 22         mod(rownum,32768) AS n_32768,
 23         mod(rownum,65536) AS n_65536,
 24         mod(rownum,131072) AS n_131072,
 25         mod(rownum,262144) AS n_262144,
 26         mod(rownum,524288) AS n_524288,
 27         mod(rownum,1048576) AS n_1048576,
 28         mod(rownum,2097152) AS n_2097152,
 29         mod(rownum,4194304) AS n_4194304,
 30         mod(rownum,8388608) AS n_8388608,
 31         mod(rownum,16777216) AS n_16777216
 32  FROM t1000, t1000, t1000
 33  WHERE rownum <= 1E8;

SQL> execute dbms_stats.gather_table_stats(user,'T')

Then, for every column, I ran two queries and measured the elapsed time, the maximum amount of PGA used by the query, and the precision of the result. Note that the test case was designed to avoid the usage of a temporary segment. In other words, all data required for the aggregation was stored into the PGA. As a result, the processing was CPU bound.

SELECT /*+ no_parallel */ count(DISTINCT n_2) FROM t
SELECT /*+ no_parallel */ approx_count_distinct(n_2) FROM t

Let’s have a look to three charts summarizing how well the APPROX_COUNT_DISTINCT function works:

  • The first chart shows that for both functions the elapsed time depends on the number of distinct values. That said, while for the APPROX_COUNT_DISTINCT function the impact of the number of distinct values is minimal (the elapsed time goes from 5.8 up to 7.1 seconds), for the COUNT function it’s much higher (the elapsed time goes from 10.5 up to 35.2 seconds).

Elapsed Time

  • The second chart (be careful that the y-axis scale is logarithmic) shows that only for the COUNT function the maximum amount of PGA depends on the number of distinct values (from 10MB up to 807MB). In fact, for the APPROX_COUNT_DISTINCT function, it was exactly the same (10MB) in all situations.

Elapsed Time

  • The third chart shows that the precision of the APPROX_COUNT_DISTINCT function stays between -4% and +4%.

Elapsed Time

According to this test case, in my opinion, the performance and accuracy of the APPROX_COUNT_DISTINCT function with numerical values (I still have to test other data types) are good. Hence, I see no reason for not using it when an estimate of the number of distinct values is enough.

5 Comments

  1. October 23, 2014    

    Wow! Very interesting! Thanks for publishing this, Chris.

  2. Koltai Róbert Koltai Róbert
    October 27, 2014    

    Hi Chris!

    Thanks for the interesting article!
    One question: How did you make sure (“design the test case”) that no temp was used?
    How did you check it?

    Thanks,
    Rob.

    • October 27, 2014    

      Hi Rob

      I ran the queries and then checked, with a query like the following one, whether the foreground process experienced temp-related waits (e.g. “direct path write temp” and “direct path read temp”):

      SELECT event, time_waited_micro
      FROM v$session_event
      WHERE sid=sys_context(‘userenv’,’sid’)

      Cheers,
      Chris

  3. Praveen Praveen
    April 30, 2015    

    How does APPROX_COUNT_DISTINCT take less time for processing than normal count distinct? What is the processing happening behind APPROX_COUNT_DISTINCT which makes it more efficient?

No Pings Yet

  1. King Training Resources on May 24, 2015 at 03:21
  2. APPROX_COUNT_DISTINCT | Emrah METE on August 10, 2015 at 11:43

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.