Jan 26 2010

Does the Query Optimizer Cost PX Distribution Methods?

Tag: 10gR1, 10gR2, 11gR1, 11gR2, 9iR2, Parallel Processing, Query OptimizerChristian Antognini @ 12:55 pm

The short answer to this question is “yes”, it does. Unfortunately, the distribution costs are not externalized through the execution plans and, as a result, this limitation (yes, it is really a limitation in the current implementation, not a bug) confuses everyone that carefully look at the information provided in an execution plan of a SQL statement executed in parallel. Hence, let’s remove some confusion…

To illustrate what the problem is, let’s have a look to a simple query that joins two tables:

SELECT * FROM master m JOIN detail d ON (m.id = d.id)

Now, let’s have a look at two parallel executions. If the two tables are equipartitioned, the following execution plan (which takes advantage of partition-wise join) is probably the most effective for such a query. Note that thanks to the partition-wise join not only there is a single set of parallel slaves (Q1,00), but, in addition, the parallel slaves do not communicate with each other (they only communicate with the query coordinator). As a result, the communication costs are equal to zero (this is because the query optimizer does not compute the costs of the communication towards the query coordinator).

----------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Bytes | Cost (%CPU)|    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |    16G| 162524  (1)|        |      |            |
|   1 |  PX COORDINATOR         |          |       |            |        |      |            |
|   2 |   PX SEND QC (RANDOM)   | :TQ10000 |    16G| 162524  (1)|  Q1,00 | P->S | QC (RAND)  |
|   3 |    PX PARTITION HASH ALL|          |    16G| 162524  (1)|  Q1,00 | PCWC |            |
|   4 |     HASH JOIN           |          |    16G| 162524  (1)|  Q1,00 | PCWP |            |
|   5 |      TABLE ACCESS FULL  | MASTER   |   125M|   1422  (1)|  Q1,00 | PCWP |            |
|   6 |      TABLE ACCESS FULL  | DETAIL   |    15G| 161052  (1)|  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------

If the two tables are not equipartitioned, the following execution plan might be chosen by the query optimizer. Since it does not take advantage of a partition-wise join, several set of parallel slaves are used. The first one (Q1,00) scans the MASTER table, the second one (Q1,01) scans the DETAIL table, and both of them send the data to the third one (Q1,02) that performs the join of the two tables and sends the data to the query coordinator. Since all data (about 15GB; yes, the estimations are good) is sent through the PX channels, the cost should not be zero. However, as you can see, the cost is exactly the same as the one of the previous execution plan.

----------------------------------------------------------------------------------------------
| Id  | Operation               | Name     | Bytes | Cost (%CPU)|    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT        |          |    16G| 162524  (1)|        |      |            |
|   1 |  PX COORDINATOR         |          |       |            |        |      |            |
|   2 |   PX SEND QC (RANDOM)   | :TQ10002 |    16G| 162524  (1)|  Q1,02 | P->S | QC (RAND)  |
|   3 |    HASH JOIN BUFFERED   |          |    16G| 162524  (1)|  Q1,02 | PCWP |            |
|   4 |     PX RECEIVE          |          |   125M|   1422  (1)|  Q1,02 | PCWP |            |
|   5 |      PX SEND HASH       | :TQ10000 |   125M|   1422  (1)|  Q1,00 | P->P | HASH       |
|   6 |       PX BLOCK ITERATOR |          |   125M|   1422  (1)|  Q1,00 | PCWC |            |
|   7 |        TABLE ACCESS FULL| MASTER   |   125M|   1422  (1)|  Q1,00 | PCWP |            |
|   8 |     PX RECEIVE          |          |    15G| 161052  (1)|  Q1,02 | PCWP |            |
|   9 |      PX SEND HASH       | :TQ10001 |    15G| 161052  (1)|  Q1,01 | P->P | HASH       |
|  10 |       PX BLOCK ITERATOR |          |    15G| 161052  (1)|  Q1,01 | PCWC |            |
|  11 |        TABLE ACCESS FULL| DETAIL   |    15G| 161052  (1)|  Q1,01 | PCWP |            |
----------------------------------------------------------------------------------------------

For completeness, let’s compare the cost of several distribution methods (“none-none” is the one of the first execution plan above, “hash-hash” of the second one). As you can see the cost is always the same!

SQL> EXPLAIN PLAN SET STATEMENT_ID 'none-none' FOR SELECT /*+ pq_distribute(d none none) */ * FROM master m JOIN detail d ON (m.id = d.id);
SQL> EXPLAIN PLAN SET STATEMENT_ID 'hash-hash' FOR SELECT /*+ pq_distribute(d hash hash) */ * FROM master m JOIN detail d ON (m.id = d.id);
SQL> EXPLAIN PLAN SET STATEMENT_ID 'broadcast-none' FOR SELECT /*+ pq_distribute(d broadcast none) */ * FROM master m JOIN detail d ON (m.id = d.id);
SQL> EXPLAIN PLAN SET STATEMENT_ID 'none-broadcast' FOR SELECT /*+ pq_distribute(d none broadcast) */ * FROM master m JOIN detail d ON (m.id = d.id);
SQL> EXPLAIN PLAN SET STATEMENT_ID 'partition-none' FOR SELECT /*+ pq_distribute(d partition none) */ * FROM master m JOIN detail d ON (m.id = d.id);
SQL> EXPLAIN PLAN SET STATEMENT_ID 'none-partition' FOR SELECT /*+ pq_distribute(d none partition) */ * FROM master m JOIN detail d ON (m.id = d.id);
SQL> SELECT statement_id, cost FROM plan_table WHERE id = 0;

STATEMENT_ID                         COST
------------------------------ ----------
none-none                          162524
hash-hash                          162524
broadcast-none                     162524
none-broadcast                     162524
partition-none                     162524
none-partition                     162524

As I wrote before, the problem is not that the costs are not computed. The problem is that they are not externalized. In fact, by giving a look to a trace file generated through the event 10053 the costs are available. Here’s the relevant part (the lines starting with “---- cost” contain the most important information). As you can see there are two costs associated with every distribution method: one with the distribution costs (w/ dist) and one without them (w/o dist).

Enumerating distribution method for join between M[MASTER] and D[DETAIL]
-- Using join method #Hash Join:
---- cost NONE = 0.00
  Outer table:  MASTER  Alias: M
    resc: 5120.11  card 4118000.00  bytes: 32  deg: 4  resp: 1422.25
  Inner table:  DETAIL  Alias: D
    resc: 579787.63  card: 31954000.00  bytes: 526  deg: 4  resp: 161052.12
    using dmeth: 513  #groups: 1
    Cost per ptn: 49.68  #ptns: 4
    hash_area: 16384 (max=16384) buildfrag: 5530  probefrag: 524636  ppasses: 1
      buildfrag: 5530  probefrag: 524636  passes: 1
  Hash join: Resc: 585106.45  Resp: 162524.05  [multiMatchCost=0.00]
---- cost(Hash Join) = 162524.05 (w/o dist), 162524.05 (w/ dist)
---- cost VALUE = 278.52
---- cost with slave mapping  =   Outer table:  MASTER  Alias: M
    resc: 5120.11  card 4118000.00  bytes: 32  deg: 4  resp: 1422.25
  Inner table:  DETAIL  Alias: D
    resc: 579787.63  card: 31954000.00  bytes: 526  deg: 4  resp: 161052.12
    using dmeth: 2  #groups: 1
    Cost per ptn: 49.68  #ptns: 4
    hash_area: 16384 (max=16384) buildfrag: 5530  probefrag: 524636  ppasses: 1
      buildfrag: 5530  probefrag: 524636  passes: 1
  Hash join: Resc: 585106.45  Resp: 162524.05  [multiMatchCost=0.00]
---- cost(Hash Join) = 162524.05 (w/o dist), 162802.57 (w/ dist)
---- cost PARTITION-RIGHT = 271.40
  Outer table:  MASTER  Alias: M
    resc: 5120.11  card 4118000.00  bytes: 32  deg: 4  resp: 1422.25
  Inner table:  DETAIL  Alias: D
    resc: 579787.63  card: 31954000.00  bytes: 526  deg: 4  resp: 161052.12
    using dmeth: 576  #groups: 1
    Cost per ptn: 49.68  #ptns: 4
    hash_area: 16384 (max=16384) buildfrag: 5530  probefrag: 524636  ppasses: 1
      buildfrag: 5530  probefrag: 524636  passes: 1
  Hash join: Resc: 585106.45  Resp: 162524.05  [multiMatchCost=0.00]
---- cost(Hash Join) = 162524.05 (w/o dist), 162795.46 (w/ dist)
---- cost PARTITION-LEFT = 7.12
  Outer table:  MASTER  Alias: M
    resc: 5120.11  card 4118000.00  bytes: 32  deg: 4  resp: 1422.25
  Inner table:  DETAIL  Alias: D
    resc: 579787.63  card: 31954000.00  bytes: 526  deg: 4  resp: 161052.12
    using dmeth: 544  #groups: 1
    Cost per ptn: 49.68  #ptns: 4
    hash_area: 16384 (max=16384) buildfrag: 5530  probefrag: 524636  ppasses: 1
      buildfrag: 5530  probefrag: 524636  passes: 1
  Hash join: Resc: 585106.45  Resp: 162524.05  [multiMatchCost=0.00]
---- cost(Hash Join) = 162524.05 (w/o dist), 162531.17 (w/ dist)
---- cost BROADCAST-RIGHT = 920.78
---- cost with slave mapping  =   Outer table:  MASTER  Alias: M
    resc: 5120.11  card 4118000.00  bytes: 32  deg: 4  resp: 1422.25
  Inner table:  DETAIL  Alias: D
    resc: 579787.63  card: 31954000.00  bytes: 526  deg: 4  resp: 161052.12
    using dmeth: 8  #groups: 4
    Cost per ptn: 49.68  #ptns: 4
    hash_area: 16384 (max=16384) buildfrag: 5530  probefrag: 524636  ppasses: 1
      buildfrag: 5530  probefrag: 524636  passes: 1
  Hash join: Resc: 585106.45  Resp: 162524.05  [multiMatchCost=0.00]
---- cost(Hash Join) = 162524.05 (w/o dist), 162755.25 (w/ dist)
---- cost BROADCAST-LEFT = 7.22
---- cost with slave mapping  =   Outer table:  MASTER  Alias: M
    resc: 5120.11  card 4118000.00  bytes: 32  deg: 4  resp: 1422.25
  Inner table:  DETAIL  Alias: D
    resc: 579787.63  card: 31954000.00  bytes: 526  deg: 4  resp: 161052.12
    using dmeth: 16  #groups: 4
    Cost per ptn: 49.68  #ptns: 4
    hash_area: 16384 (max=16384) buildfrag: 5530  probefrag: 524636  ppasses: 1
      buildfrag: 5530  probefrag: 524636  passes: 1
  Hash join: Resc: 585106.45  Resp: 162524.05  [multiMatchCost=0.00]
---- cost(Hash Join) = 162524.05 (w/o dist), 162526.86 (w/ dist)

Since the cost are (correctly) computed, the query optimizer is able to choose the optimal plan. However, it would be nice to have the actual costs in the execution plans.


Aug 12 2009

System Managed Extent Size – 11g Improvements

Tag: 11gR1, Parallel ProcessingChristian Antognini @ 2:20 pm

Oracle Database provides two extent management options: locally managed and dictionary managed. Today, when creating a new database, most DBAs use the former for all tablespaces. This is a good thing. What is less obvious is the choice between uniform extent size and system managed extent size. Both of these options have pros and cons. The tendency is to use uniform extent size only when the size of the segments to be stored in a tablespace is (well) known. Otherwise, it is better to delegate the choice of an “optimal” extent size to the database engine. In fact, the aim of system managed extent size is to adjust the size of the extents to the size of the segments to which they are associated. In other words, a large segment should have large extents, a small segment should have small extents. In this way it should be possible to avoid both fragmentation and very high number of extents. To do so, the size of the extents is normalized to a limited number of sizes: 64KB, 1MB, 8MB or 64MB.

Now, let’s discuss two enhancements that have been (silently?) introduced in 11.1.0.7… Yes, in 11.1.0.6 they are not available!

But, before that, let’s create a tablespace and a view that will be used for the tests I’ll show you…

SQL> CREATE TABLESPACE test
  2  DATAFILE '/u01/oradata/DBM11106/test01.dbf' SIZE 1G
  3  BLOCKSIZE 8K
  4  EXTENT MANAGEMENT LOCAL AUTOALLOCATE
  5  SEGMENT SPACE MANAGEMENT AUTO;

SQL> CREATE OR REPLACE VIEW v AS
  2  SELECT bytes, extent_id-lag(extent_id,1,-1) OVER (ORDER BY extent_id) AS extents
  3  FROM (
  4    SELECT bytes, extent_id, decode(bytes,lead(bytes) OVER (ORDER BY extent_id),0,1) AS last
  5    FROM user_extents
  6    WHERE segment_name = 'T'
  7  )
  8  WHERE last = 1
  9  ORDER BY extent_id;

Improvement #1

Up to 11.1.0.6, INITIAL (the attribute that can be specified in the storage clause) is only partially considered when creating a new segment. In fact, as the following example shows, even if a segment of 1MB is created, sixteen extents of 64KB are associated to it.

SQL> CREATE TABLE t (id NUMBER, pad VARCHAR2(1000))
  2  TABLESPACE test
  3  STORAGE (INITIAL 1M);

SQL> SELECT * FROM v;

     BYTES    EXTENTS
---------- ----------
     65536         16

One might think that INITIAL is not considered at all. However, it is easy to show that it’s actually used. To do so it’s sufficient to create a segment of 2MB. As the following example shows, two extents of 1MB are associated to it. In other words, even if the value of INITIAL is considered, the size of the first extent is not directly derived from it.

SQL> CREATE TABLE t (id NUMBER, pad VARCHAR2(1000))
  2  TABLESPACE test
  3  STORAGE (INITIAL 2M);

SQL> SELECT * FROM v;

     BYTES    EXTENTS
---------- ----------
   1048576          2

As of 11.1.0.7, however, the size of the first extent is directly derived from INITIAL. The only and obvious limitation is that the extent size is normalized to either 64KB, 1MB, 8MB or 64MB. To do so the database engine uses the extent size which is equal or smaller to INITIAL. Therefore, when the same SQL statements as before are executed, for a segment of 1MB only one extent of 1MB is associated to it.

SQL> CREATE TABLE t (id NUMBER, pad VARCHAR2(1000))
  2  TABLESPACE test
  3  STORAGE (INITIAL 1M);

SQL> SELECT * FROM v;

     BYTES    EXTENTS
---------- ----------
   1048576          1

A curios side effect of this new behavior is that it’s much common to see segments for which the first extent is larger than the second one. Such situations are not new… but, they are much more common than before. The following example illustrates this. Notice that, for a 10MB segment, the first extent is 8MB and the subsequent two are only 1MB each.

SQL> CREATE TABLE t (id NUMBER, pad VARCHAR2(1000))
  2  TABLESPACE test
  3  STORAGE (INITIAL 10M);

SQL> SELECT * FROM v;

     BYTES    EXTENTS
---------- ----------
   8388608          1
   1048576          2

In summary, as of 11.1.0.7, if you set INITIAL to either 64KB, 1MB, 8MB or 64MB, the first extent will have the specified size. So, if you know what the size of an segment will be, it is not bad to provide a sensible value through INITIAL.

Improvement #2

As you might guess, the second improvement is related to another attribute that can be specified in the storage clause: NEXT. The improvement, however, is a not what you might expect in first place. In fact, NEXT is not simply used to specify the size of any extent allocated after the first one. According to my observations NEXT is only used when a parallel load is performed.

Up to 11.1.0.6, when a parallel load is performed, every parallel slave allocates extents from a temporary segment and loads the data into it. In other words, the parallel slaves do not modify the target table. Then, when the transaction is committed, the temporary segment is merged to the target table. The following example illustrates this behavior. Notice three things. First, NEXT is set to 1MB. Second, the insert loads about 10MB of data. Third, I forced the utilization of two parallel slaves. Since each parallel slave allocated its own extents, every parallel slave allocates sixteen 64KB extents, four 1MB extents and one 832KB extent (for a total of about 6MB per parallel slave). Simply put, the extent allocation for the temporary segment follows the rules we are used to.

SQL> CREATE TABLE t (id NUMBER, pad VARCHAR2(1000))
  2  TABLESPACE test
  3  STORAGE (NEXT 1M);

SQL> SELECT * FROM v;

     BYTES    EXTENTS
---------- ----------
     65536          1

SQL> ALTER SESSION FORCE PARALLEL DML PARALLEL 2;

SQL> INSERT INTO t
  2  SELECT rownum, rpad('*',1000,'*')
  3  FROM dual
  4  CONNECT BY level <= 10000;

SQL> SELECT bytes, extent_id-lag(extent_id,1,-1) OVER (ORDER BY extent_id) AS extents
  2  FROM (
  3    SELECT bytes, extent_id, decode(bytes,lead(bytes) OVER (ORDER BY extent_id),0,1) AS last
  4    FROM user_extents
  5    WHERE segment_type = 'TEMPORARY'
  6  )
  7  WHERE last = 1
  8  ORDER BY extent_id;

     BYTES    EXTENTS
---------- ----------
     65536         16
   1048576          4
    851968          1
     65536         16
   1048576          4
    851968          1

SQL> COMMIT;

SQL> SELECT * FROM v;

     BYTES    EXTENTS
---------- ----------
     65536         17
   1048576          4
    851968          1
     65536         16
   1048576          4
    851968          1

As of 11.1.0.7, NEXT is used for sizing the extents of the temporary segment. As a result, if you know that are loading a lot of data, you can avoid the small extents (that might lead to suboptimal performance when the data will be accessed…). Let’s re-execute the same SQL statement as before to see what the difference is. Notice that the only extent of 64KB is the original one that was initially associated to the table.

SQL> CREATE TABLE t (id NUMBER, pad VARCHAR2(1000))
  2  TABLESPACE test
  3  STORAGE (NEXT 1M);

SQL> SELECT * FROM v;

     BYTES    EXTENTS
---------- ----------
     65536          1

SQL> ALTER SESSION FORCE PARALLEL DML PARALLEL 2;

SQL> INSERT INTO t
  2  SELECT rownum, rpad('*',1000,'*')
  3  FROM dual
  4  CONNECT BY level <= 10000;

SQL> SELECT bytes, extent_id-lag(extent_id,1,-1) OVER (ORDER BY extent_id) AS extents
  2  FROM (
  3    SELECT bytes, extent_id, decode(bytes,lead(bytes) OVER (ORDER BY extent_id),0,1) AS last
  4    FROM user_extents
  5    WHERE segment_type = 'TEMPORARY'
  6  )
  7  WHERE last = 1
  8  ORDER BY extent_id;

     BYTES    EXTENTS
---------- ----------
   1048576          5
    786432          1
   1048576          5
    786432          1

SQL> COMMIT;

SQL> SELECT * FROM v;

     BYTES    EXTENTS
---------- ----------
     65536          1
   1048576          5
    786432          1
   1048576          5
    786432          1

In summary, as of 11.1.0.7, with NEXT you can control the sizing of the temporary extents used during parallel loads.


Sep 26 2008

Bloom Filters

Tag: 10gR2, 11gR1, Parallel Processing, Partitioning, Query OptimizerChristian Antognini @ 6:24 pm

Oracle Database, as of 10g Release 2, uses bloom filters in various situations. Unfortunately, no information about their usage is available in Oracle documentation. For this reason, I decided to write a paper to explain not only what bloom filters are, but also, and foremost, to describe how Oracle Database makes use of them. Specifically, to explain how the database engine uses bloom filters to reduce data communication between slave processes in parallel joins and to implement join-filter pruning.

Originally, I wrote this paper for the IOUG Select Journal. Even if I wrote it last June, I wanted to receive the printed copies before putting it online. Today a packet with five copies of the Q4/2008 issue and a polo shirt arrived… Hence, it is now available online as well.

If you don’t have access to Select Journal, you can download it from this page.