Striving for Optimal Performance
  • Blog
    • Archive
    • Categories
    • Search
  • Troubleshooting Oracle Performance
    • Description
    • Structure
    • Table of Contents
    • Forewords
    • Reviews
    • Downloadable Files
    • Addenda and Errata
    • First Edition
  • Public Appearances
    • Past Public Appearances
  • Contact
  • About

The APPROX_COUNT_DISTINCT Function – A Test Case

23 October 2014 9 Comments Written by Christian Antognini

The aim of this post is not to explain how the APPROX_COUNT_DISTINCT function works (you find basic information in the documentation and in this post written by Luca Canali), but to show you the results of a test case I run to assess how well it works.

Here’s what I did…

I created a table with several numerical columns (the name of the column shows how many distinct values it contains), loaded 100 million rows into it (the size of the segment is 12.7 GB), and gathered the object statistics.

SQL> CREATE TABLE t
  2  AS
  3  WITH
  4    t1000 AS (SELECT /*+ materialize */ rownum AS n
  5              FROM dual
  6              CONNECT BY level <= 1E3)
  7  SELECT rownum AS id,
  8         mod(rownum,2) AS n_2,
  9         mod(rownum,4) AS n_4,
 10         mod(rownum,8) AS n_8,
 11         mod(rownum,16) AS n_16,
 12         mod(rownum,32) AS n_32,
 13         mod(rownum,64) AS n_64,
 14         mod(rownum,128) AS n_128,
 15         mod(rownum,256) AS n_256,
 16         mod(rownum,512) AS n_512,
 17         mod(rownum,1024) AS n_1024,
 18         mod(rownum,2048) AS n_2048,
 19         mod(rownum,4096) AS n_4096,
 20         mod(rownum,8192) AS n_8192,
 21         mod(rownum,16384) AS n_16384,
 22         mod(rownum,32768) AS n_32768,
 23         mod(rownum,65536) AS n_65536,
 24         mod(rownum,131072) AS n_131072,
 25         mod(rownum,262144) AS n_262144,
 26         mod(rownum,524288) AS n_524288,
 27         mod(rownum,1048576) AS n_1048576,
 28         mod(rownum,2097152) AS n_2097152,
 29         mod(rownum,4194304) AS n_4194304,
 30         mod(rownum,8388608) AS n_8388608,
 31         mod(rownum,16777216) AS n_16777216
 32  FROM t1000, t1000, t1000
 33  WHERE rownum <= 1E8;

SQL> execute dbms_stats.gather_table_stats(user,'T')

Then, for every column, I ran two queries and measured the elapsed time, the maximum amount of PGA used by the query, and the precision of the result. Note that the test case was designed to avoid the usage of a temporary segment. In other words, all data required for the aggregation was stored into the PGA. As a result, the processing was CPU bound.

SELECT /*+ no_parallel */ count(DISTINCT n_2) FROM t
SELECT /*+ no_parallel */ approx_count_distinct(n_2) FROM t

Let’s have a look to three charts summarizing how well the APPROX_COUNT_DISTINCT function works:

  • The first chart shows that for both functions the elapsed time depends on the number of distinct values. That said, while for the APPROX_COUNT_DISTINCT function the impact of the number of distinct values is minimal (the elapsed time goes from 5.8 up to 7.1 seconds), for the COUNT function it’s much higher (the elapsed time goes from 10.5 up to 35.2 seconds).

Elapsed Time

  • The second chart (be careful that the y-axis scale is logarithmic) shows that only for the COUNT function the maximum amount of PGA depends on the number of distinct values (from 10MB up to 807MB). In fact, for the APPROX_COUNT_DISTINCT function, it was exactly the same (10MB) in all situations.

Elapsed Time

  • The third chart shows that the precision of the APPROX_COUNT_DISTINCT function stays between -4% and +4%.

Elapsed Time

According to this test case, in my opinion, the performance and accuracy of the APPROX_COUNT_DISTINCT function with numerical values (I still have to test other data types) are good. Hence, I see no reason for not using it when an estimate of the number of distinct values is enough.

12.1.0.2, Approximate Query Processing
RESULT_CACHE Hint in 12.1.0.2
About Oracle’s Code of Ethics, MOS Confidentiality Clause, and Content Stolen from My Website

9 Comments

2 Pings/Trackbacks

  1. Noons Noons
    23 October 2014    

    Wow! Very interesting! Thanks for publishing this, Chris.

    Reply
  2. Koltai Róbert Koltai Róbert
    27 October 2014    

    Hi Chris!

    Thanks for the interesting article!
    One question: How did you make sure (“design the test case”) that no temp was used?
    How did you check it?

    Thanks,
    Rob.

    Reply
    • Christian Antognini Christian Antognini
      27 October 2014    

      Hi Rob

      I ran the queries and then checked, with a query like the following one, whether the foreground process experienced temp-related waits (e.g. “direct path write temp” and “direct path read temp”):

      SELECT event, time_waited_micro
      FROM v$session_event
      WHERE sid=sys_context(‘userenv’,’sid’)

      Cheers,
      Chris

      Reply
  3. Praveen Praveen
    30 April 2015    

    How does APPROX_COUNT_DISTINCT take less time for processing than normal count distinct? What is the processing happening behind APPROX_COUNT_DISTINCT which makes it more efficient?

    Reply
    • Christian Antognini Christian Antognini
      1 May 2015    

      Hi Praveen

      It implements HyperLogLog. For detailed information have a look to the paper HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm.

      Best,
      Chris

      Reply
  4. King Training Resources King Training Resources
    24 May 2015    

    […] provides an approximate value without actually processing all of the database rows (Oak Table Network member and Oracle Ace Director Christian Antognini has documented consistent accuracy of plus-or-minus 4% with considerable performance […]

  5. APPROX_COUNT_DISTINCT | Emrah METE APPROX_COUNT_DISTINCT | Emrah METE
    10 August 2015    

    […] http://antognini.ch/2014/10/the-approx_count_distinct-function-a-test-case/ […]

  6. Vikram Rathour Vikram Rathour
    20 December 2019    

    I get the following error with APPROX_COUNT_DISTINCT(). This doe snot occur with COUNT(DISTINCT …)

    ORA-04036: PGA memory used by the instance exceeds PGA_AGGREGATE_LIMIT
    04036. 00000 – “PGA memory used by the instance exceeds PGA_AGGREGATE_LIMIT”
    *Cause: Private memory across the instance exceeded the limit specified
    in the PGA_AGGREGATE_LIMIT initialization parameter. The largest
    sessions using Program Global Area (PGA) memory were interrupted
    to get under the limit.
    *Action: Increase the PGA_AGGREGATE_LIMIT initialization parameter or reduce
    memory usage.

    Reply
    • Christian Antognini Christian Antognini
      20 December 2019    

      Looks buggy… What’s the PGA consumption for both cases?

      Reply
  1. King Training Resources on 24 May 2015 at 03:21
  2. APPROX_COUNT_DISTINCT | Emrah METE on 10 August 2015 at 11:43

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.