The short answer to this question is “yes”, it does. Unfortunately, the distribution costs are not externalized through the execution plans and, as a result, this limitation (yes, it is really a limitation in the current implementation, not a bug) confuses everyone that carefully look at the information provided in an execution plan of a SQL statement executed in parallel. Hence, let’s remove some confusion…
To illustrate what the problem is, let’s have a look to a simple query that joins two tables:
SELECT * FROM master m JOIN detail d ON (m.id = d.id)
Now, let’s have a look at two parallel executions. If the two tables are equipartitioned, the following execution plan (which takes advantage of partition-wise join) is probably the most effective for such a query. Note that thanks to the partition-wise join not only there is a single set of parallel slaves (Q1,00), but, in addition, the parallel slaves do not communicate with each other (they only communicate with the query coordinator). As a result, the communication costs are equal to zero (this is because the query optimizer does not compute the costs of the communication towards the query coordinator).
---------------------------------------------------------------------------------------------- | Id | Operation | Name | Bytes | Cost (%CPU)| TQ |IN-OUT| PQ Distrib | ---------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 16G| 162524 (1)| | | | | 1 | PX COORDINATOR | | | | | | | | 2 | PX SEND QC (RANDOM) | :TQ10000 | 16G| 162524 (1)| Q1,00 | P->S | QC (RAND) | | 3 | PX PARTITION HASH ALL| | 16G| 162524 (1)| Q1,00 | PCWC | | | 4 | HASH JOIN | | 16G| 162524 (1)| Q1,00 | PCWP | | | 5 | TABLE ACCESS FULL | MASTER | 125M| 1422 (1)| Q1,00 | PCWP | | | 6 | TABLE ACCESS FULL | DETAIL | 15G| 161052 (1)| Q1,00 | PCWP | | ----------------------------------------------------------------------------------------------
If the two tables are not equipartitioned, the following execution plan might be chosen by the query optimizer. Since it does not take advantage of a partition-wise join, several set of parallel slaves are used. The first one (Q1,00) scans the MASTER table, the second one (Q1,01) scans the DETAIL table, and both of them send the data to the third one (Q1,02) that performs the join of the two tables and sends the data to the query coordinator. Since all data (about 15GB; yes, the estimations are good) is sent through the PX channels, the cost should not be zero. However, as you can see, the cost is exactly the same as the one of the previous execution plan.
---------------------------------------------------------------------------------------------- | Id | Operation | Name | Bytes | Cost (%CPU)| TQ |IN-OUT| PQ Distrib | ---------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 16G| 162524 (1)| | | | | 1 | PX COORDINATOR | | | | | | | | 2 | PX SEND QC (RANDOM) | :TQ10002 | 16G| 162524 (1)| Q1,02 | P->S | QC (RAND) | | 3 | HASH JOIN BUFFERED | | 16G| 162524 (1)| Q1,02 | PCWP | | | 4 | PX RECEIVE | | 125M| 1422 (1)| Q1,02 | PCWP | | | 5 | PX SEND HASH | :TQ10000 | 125M| 1422 (1)| Q1,00 | P->P | HASH | | 6 | PX BLOCK ITERATOR | | 125M| 1422 (1)| Q1,00 | PCWC | | | 7 | TABLE ACCESS FULL| MASTER | 125M| 1422 (1)| Q1,00 | PCWP | | | 8 | PX RECEIVE | | 15G| 161052 (1)| Q1,02 | PCWP | | | 9 | PX SEND HASH | :TQ10001 | 15G| 161052 (1)| Q1,01 | P->P | HASH | | 10 | PX BLOCK ITERATOR | | 15G| 161052 (1)| Q1,01 | PCWC | | | 11 | TABLE ACCESS FULL| DETAIL | 15G| 161052 (1)| Q1,01 | PCWP | | ----------------------------------------------------------------------------------------------
For completeness, let’s compare the cost of several distribution methods (“none-none” is the one of the first execution plan above, “hash-hash” of the second one). As you can see the cost is always the same!
SQL> EXPLAIN PLAN SET STATEMENT_ID 'none-none' FOR SELECT /*+ pq_distribute(d none none) */ * FROM master m JOIN detail d ON (m.id = d.id); SQL> EXPLAIN PLAN SET STATEMENT_ID 'hash-hash' FOR SELECT /*+ pq_distribute(d hash hash) */ * FROM master m JOIN detail d ON (m.id = d.id); SQL> EXPLAIN PLAN SET STATEMENT_ID 'broadcast-none' FOR SELECT /*+ pq_distribute(d broadcast none) */ * FROM master m JOIN detail d ON (m.id = d.id); SQL> EXPLAIN PLAN SET STATEMENT_ID 'none-broadcast' FOR SELECT /*+ pq_distribute(d none broadcast) */ * FROM master m JOIN detail d ON (m.id = d.id); SQL> EXPLAIN PLAN SET STATEMENT_ID 'partition-none' FOR SELECT /*+ pq_distribute(d partition none) */ * FROM master m JOIN detail d ON (m.id = d.id); SQL> EXPLAIN PLAN SET STATEMENT_ID 'none-partition' FOR SELECT /*+ pq_distribute(d none partition) */ * FROM master m JOIN detail d ON (m.id = d.id); SQL> SELECT statement_id, cost FROM plan_table WHERE id = 0; STATEMENT_ID COST ------------------------------ ---------- none-none 162524 hash-hash 162524 broadcast-none 162524 none-broadcast 162524 partition-none 162524 none-partition 162524
As I wrote before, the problem is not that the costs are not computed. The problem is that they are not externalized. In fact, by giving a look to a trace file generated through the event 10053 the costs are available. Here’s the relevant part (the lines starting with “---- cost
” contain the most important information). As you can see there are two costs associated with every distribution method: one with the distribution costs (w/ dist) and one without them (w/o dist).
Enumerating distribution method for join between M[MASTER] and D[DETAIL] -- Using join method #Hash Join: ---- cost NONE = 0.00 Outer table: MASTER Alias: M resc: 5120.11 card 4118000.00 bytes: 32 deg: 4 resp: 1422.25 Inner table: DETAIL Alias: D resc: 579787.63 card: 31954000.00 bytes: 526 deg: 4 resp: 161052.12 using dmeth: 513 #groups: 1 Cost per ptn: 49.68 #ptns: 4 hash_area: 16384 (max=16384) buildfrag: 5530 probefrag: 524636 ppasses: 1 buildfrag: 5530 probefrag: 524636 passes: 1 Hash join: Resc: 585106.45 Resp: 162524.05 [multiMatchCost=0.00] ---- cost(Hash Join) = 162524.05 (w/o dist), 162524.05 (w/ dist) ---- cost VALUE = 278.52 ---- cost with slave mapping = Outer table: MASTER Alias: M resc: 5120.11 card 4118000.00 bytes: 32 deg: 4 resp: 1422.25 Inner table: DETAIL Alias: D resc: 579787.63 card: 31954000.00 bytes: 526 deg: 4 resp: 161052.12 using dmeth: 2 #groups: 1 Cost per ptn: 49.68 #ptns: 4 hash_area: 16384 (max=16384) buildfrag: 5530 probefrag: 524636 ppasses: 1 buildfrag: 5530 probefrag: 524636 passes: 1 Hash join: Resc: 585106.45 Resp: 162524.05 [multiMatchCost=0.00] ---- cost(Hash Join) = 162524.05 (w/o dist), 162802.57 (w/ dist) ---- cost PARTITION-RIGHT = 271.40 Outer table: MASTER Alias: M resc: 5120.11 card 4118000.00 bytes: 32 deg: 4 resp: 1422.25 Inner table: DETAIL Alias: D resc: 579787.63 card: 31954000.00 bytes: 526 deg: 4 resp: 161052.12 using dmeth: 576 #groups: 1 Cost per ptn: 49.68 #ptns: 4 hash_area: 16384 (max=16384) buildfrag: 5530 probefrag: 524636 ppasses: 1 buildfrag: 5530 probefrag: 524636 passes: 1 Hash join: Resc: 585106.45 Resp: 162524.05 [multiMatchCost=0.00] ---- cost(Hash Join) = 162524.05 (w/o dist), 162795.46 (w/ dist) ---- cost PARTITION-LEFT = 7.12 Outer table: MASTER Alias: M resc: 5120.11 card 4118000.00 bytes: 32 deg: 4 resp: 1422.25 Inner table: DETAIL Alias: D resc: 579787.63 card: 31954000.00 bytes: 526 deg: 4 resp: 161052.12 using dmeth: 544 #groups: 1 Cost per ptn: 49.68 #ptns: 4 hash_area: 16384 (max=16384) buildfrag: 5530 probefrag: 524636 ppasses: 1 buildfrag: 5530 probefrag: 524636 passes: 1 Hash join: Resc: 585106.45 Resp: 162524.05 [multiMatchCost=0.00] ---- cost(Hash Join) = 162524.05 (w/o dist), 162531.17 (w/ dist) ---- cost BROADCAST-RIGHT = 920.78 ---- cost with slave mapping = Outer table: MASTER Alias: M resc: 5120.11 card 4118000.00 bytes: 32 deg: 4 resp: 1422.25 Inner table: DETAIL Alias: D resc: 579787.63 card: 31954000.00 bytes: 526 deg: 4 resp: 161052.12 using dmeth: 8 #groups: 4 Cost per ptn: 49.68 #ptns: 4 hash_area: 16384 (max=16384) buildfrag: 5530 probefrag: 524636 ppasses: 1 buildfrag: 5530 probefrag: 524636 passes: 1 Hash join: Resc: 585106.45 Resp: 162524.05 [multiMatchCost=0.00] ---- cost(Hash Join) = 162524.05 (w/o dist), 162755.25 (w/ dist) ---- cost BROADCAST-LEFT = 7.22 ---- cost with slave mapping = Outer table: MASTER Alias: M resc: 5120.11 card 4118000.00 bytes: 32 deg: 4 resp: 1422.25 Inner table: DETAIL Alias: D resc: 579787.63 card: 31954000.00 bytes: 526 deg: 4 resp: 161052.12 using dmeth: 16 #groups: 4 Cost per ptn: 49.68 #ptns: 4 hash_area: 16384 (max=16384) buildfrag: 5530 probefrag: 524636 ppasses: 1 buildfrag: 5530 probefrag: 524636 passes: 1 Hash join: Resc: 585106.45 Resp: 162524.05 [multiMatchCost=0.00] ---- cost(Hash Join) = 162524.05 (w/o dist), 162526.86 (w/ dist)
Since the cost are (correctly) computed, the query optimizer is able to choose the optimal plan. However, it would be nice to have the actual costs in the execution plans.
Christian, thanks for this. I did a test joining two equi-partitioned tables on the partition column and the highly efficient “none-none” distribution method was not selected without a hint. In fact the CBO used “hash-hash”. Unfortunately, I seem to have lost the test case and can no longer reproduce it to get at the 10053 trace. Is there any way that “hash-hash” could be deemed “better” than “none-none” for such a query?
Hi Tony
IIRC, when I saw similar (?) cases, the problem was “due” to partition pruning. In other words, mixing FPWJ and partition pruning is not always successful.
For a join I see no case where a “none-none” might be worse than a “hash-hash”.
Cheers,
Chris
@Tony
In all likelihood your plan was not a FPWJ even though you joined on the partition keys. There would be no row redistribution (none-none) for a FPWJ plan. The DOP influences whether or not you get a FPWJ or whether there is redistribution.
So the answer to your question is yes; if the DOP does not match the number of partitions needed to be joined for a FPWJ plan then you will see a row redistribution (hash-hash) plan.
Christian, your comment reminded me of my test case. The two tables that I joined were both range partitioned by date and subpartitioned by hash on the join column. The query I ran selected a specific date in the WHERE clause. The hint was required to get the “none none” distribution mechanism despite that being demonstrably faster.
Can you explain this or do I need to look at the 10053 trace?
Thanks again.
Hi Tony
Sorry, I have no real explanation for that. It just matches something that I already observed several times… i.e. partition pruning does not always match well with PWJ. To understand what’s going on, you should have a look to the 10053 trace.
Cheers,
Chris
Hi Christian,
Can you point my to some documentation about the px-distribution methods?
I searched for it but most performance tuning books do not go into detail about it.
regards Hans-Peter
Hi Hans-Peter
Sorry, but I’m also not aware of a detailed description of the distribution methods. In my book, at page 493, I added some remarks about them.
http://books.google.com/books?id=b3DIkYO2gBQC&lpg=PP1&dq=troubleshooting%20oracle%20performance&pg=PA493#v=onepage&q=&f=false
HTH
Chris
The only stuff I have been able to find is in the documentation for the PQ_DISTRIBUTE hint but it is still quite useful. I would buy Christian’s book anyway :-)
[…] 19-How does CBO behave with different PX Distribution Methods ? Christian Antognini-Does the Query Optimizer Cost PX Distribution Methods? […]
I’ve seen problems with combined partitions and PWJ as in Tony Hasler’s test. In my case it was caused by bug 6476257.
Another requirement was that default DOP of a table should match the number of subpartitions.
Hi Michael
Thank for the interesting information. Unfortunately the base bug, 6714199, is not public. Anyway, this is an addition evidence that there are problems in this area.
Cheers,
Chris
[…] asked Christian about this and his full reply came in this blog. Christian was able to explain why the estimated costs in the execution plan were misleading and […]
[…] The Cost of PX Distribution Method by Christian Antognini […]