07 September, 2009

Table and Partition Statistics

When gathering statistics on Partitioned Tables, it is important to know the difference between Table Level and Partition Level (and, if applicable, SubPartition Level) statistics.

The Optimizer will use Partition Level statistics if the query predicates are on the Partition Keys. However, if the query predicates are not on these columns, (OR even if on the Partition Key do not allow for pruning to a *single partition* -- see update of 08-Sep below) the Optimizer relies on Table Level Statistics.

ANALYZE would aggregate Table Level statistics from Partition Level statistics. DBMS_STATS allows gathering of both or either statistics.

Here is an example showing how the Optimizer can badly estimate cardinality on a non-partition key column in the absence of Table Level statistics.

I have used the demo SH.SALES table which is Partitioned by TIME_ID. However, CUST_ID is a very relevant *dimension* column.


SQL> desc sh.sales
Name Null? Type
------------------------------------------------------------------------ -------- -------------------------------------------------
PROD_ID NOT NULL NUMBER
CUST_ID NOT NULL NUMBER
TIME_ID NOT NULL DATE
CHANNEL_ID NOT NULL NUMBER
PROMO_ID NOT NULL NUMBER
QUANTITY_SOLD NOT NULL NUMBER(10,2)
AMOUNT_SOLD NOT NULL NUMBER(10,2)

SQL>
SQL> REM The SALES table is Range Partitioned by TIME_ID
SQL> REM CUST_ID is a Dimension
SQL>
SQL> -- Current Statistics
SQL> --- NOTE !! Warning : For simplicity of reading I have converted NULL statistics to 0 (zero) for NUM_ROWS
SQL> -- However, in reality there is a big difference between NULL NUM_ROWS (meaning no statistics) and 0 NUM_ROWS (really meaning zero rows) !
SQL> exec dbms_stats.delete_table_stats('SH','SALES');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.delete_index_stats('SH','SALES_CHANNEL_BIX');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.delete_index_stats('SH','SALES_PROMO_BIX');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.delete_index_stats('SH','SALES_PROD_BIX');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.delete_index_stats('SH','SALES_CUST_BIX');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.delete_index_stats('SH','SALES_TIME_BIX');

PL/SQL procedure successfully completed.

SQL> select table_name, nvl(num_rows,0) from dba_tables where owner = 'SH' and table_name = 'SALES';

TABLE_NAME NVL(NUM_ROWS,0)
------------------------------ ---------------
SALES 0

SQL> select partition_name, nvl(num_rows,0) from dba_tab_partitions where table_owner = 'SH' and table_name = 'SALES' order by partition_position;

PARTITION_NAME NVL(NUM_ROWS,0)
------------------------------ ---------------
SALES_1995 0
SALES_1996 0
SALES_H1_1997 0
SALES_H2_1997 0
SALES_Q1_1998 0
SALES_Q2_1998 0
SALES_Q3_1998 0
SALES_Q4_1998 0
SALES_Q1_1999 0
SALES_Q2_1999 0
SALES_Q3_1999 0
SALES_Q4_1999 0
SALES_Q1_2000 0
SALES_Q2_2000 0
SALES_Q3_2000 0
SALES_Q4_2000 0
SALES_Q1_2001 0
SALES_Q2_2001 0
SALES_Q3_2001 0
SALES_Q4_2001 0
SALES_Q1_2002 0
SALES_Q2_2002 0
SALES_Q3_2002 0
SALES_Q4_2002 0
SALES_Q1_2003 0
SALES_Q2_2003 0
SALES_Q3_2003 0
SALES_Q4_2003 0

28 rows selected.

SQL>
SQL> REM REM ############################################################
SQL> -- Delete and re-gather statistics
SQL> -- statistics at partition level
SQL> exec dbms_stats.delete_table_stats('SH','SALES');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.delete_index_stats('SH','SALES_CHANNEL_BIX');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.delete_index_stats('SH','SALES_PROMO_BIX');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.delete_index_stats('SH','SALES_PROD_BIX');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.delete_index_stats('SH','SALES_CUST_BIX');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.delete_index_stats('SH','SALES_TIME_BIX');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.gather_table_stats('SH','SALES',estimate_percent=>100,granularity=>'PARTITION',cascade=>TRUE,no_invalidate=>FALSE);

PL/SQL procedure successfully completed.

SQL> select table_name, num_rows from dba_tables where owner = 'SH' and table_name = 'SALES';

TABLE_NAME NUM_ROWS
------------------------------ ----------
SALES 918843

SQL> select num_distinct, num_buckets, nvl(sample_size,0), histogram from dba_tab_col_statistics
2 where owner = 'SH' and table_name = 'SALES' and column_name = 'CUST_ID';

NUM_DISTINCT NUM_BUCKETS NVL(SAMPLE_SIZE,0) HISTOGRAM
------------ ----------- ------------------ ---------------
3203 1 0 NONE

SQL>
SQL>
SQL> explain plan for
2 select cust_id,time_id,quantity_sold
3 from sh.sales where time_id between to_date('01-JAN-2001','DD-MON-YYYY') and to_date('31-JUL-2001','DD-MON-YYYY');

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 279964487

--------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
--------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 134K| 2094K| 104 (4)| 00:00:02 | | |
| 1 | PARTITION RANGE ITERATOR| | 134K| 2094K| 104 (4)| 00:00:02 | 17 | 19 |
|* 2 | TABLE ACCESS FULL | SALES | 134K| 2094K| 104 (4)| 00:00:02 | 17 | 19 |
--------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("TIME_ID"<=TO_DATE(' 2001-07-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) 14 rows selected. SQL>
SQL> select count(*) from sh.sales where time_id between to_date('01-JAN-2001','DD-MON-YYYY') and to_date('31-JUL-2001','DD-MON-YYYY');

COUNT(*)
----------
145760

SQL>
SQL> explain plan for
2 select cust_id,time_id,quantity_sold
3 from sh.sales where cust_id in (10,100,1001);

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 2821443835

----------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
----------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 861 | 13776 | 226 (0)| 00:00:03 | | |
| 1 | PARTITION RANGE ALL | | 861 | 13776 | 226 (0)| 00:00:03 | 1 | 28 |
| 2 | INLIST ITERATOR | | | | | | | |
| 3 | TABLE ACCESS BY LOCAL INDEX ROWID| SALES | 861 | 13776 | 226 (0)| 00:00:03 | 1 | 28 |
| 4 | BITMAP CONVERSION TO ROWIDS | | | | | | | |
|* 5 | BITMAP INDEX SINGLE VALUE | SALES_CUST_BIX | | | | | 1 | 28 |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

5 - access("CUST_ID"=10 OR "CUST_ID"=100 OR "CUST_ID"=1001)

17 rows selected.

SQL>
SQL> select count(*) from sh.sales where cust_id in (10,100,1001);

COUNT(*)
----------
252

SQL>
SQL>
SQL>
SQL> REM REM ############################################################
SQL>
SQL> -- Delete and re-gather statistics
SQL> -- statistics at table and partition level
SQL> exec dbms_stats.delete_table_stats('SH','SALES');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.delete_index_stats('SH','SALES_CHANNEL_BIX');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.delete_index_stats('SH','SALES_PROMO_BIX');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.delete_index_stats('SH','SALES_PROD_BIX');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.delete_index_stats('SH','SALES_CUST_BIX');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.delete_index_stats('SH','SALES_TIME_BIX');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.gather_table_stats('SH','SALES',estimate_percent=>100,granularity=>'ALL',cascade=>TRUE,no_invalidate=>FALSE);

PL/SQL procedure successfully completed.

SQL> select table_name, num_rows from dba_tables where owner = 'SH' and table_name = 'SALES';

TABLE_NAME NUM_ROWS
------------------------------ ----------
SALES 918843

SQL> select num_distinct, num_buckets, nvl(sample_size,0), histogram from dba_tab_col_statistics
2 where owner = 'SH' and table_name = 'SALES' and column_name = 'CUST_ID';

NUM_DISTINCT NUM_BUCKETS NVL(SAMPLE_SIZE,0) HISTOGRAM
------------ ----------- ------------------ ---------------
7059 1 918843 NONE

SQL>
SQL> explain plan for
2 select cust_id,time_id,quantity_sold
3 from sh.sales where time_id between to_date('01-JAN-2001','DD-MON-YYYY') and to_date('31-JUL-2001','DD-MON-YYYY');

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 279964487

--------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
--------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 134K| 2094K| 104 (4)| 00:00:02 | | |
| 1 | PARTITION RANGE ITERATOR| | 134K| 2094K| 104 (4)| 00:00:02 | 17 | 19 |
|* 2 | TABLE ACCESS FULL | SALES | 134K| 2094K| 104 (4)| 00:00:02 | 17 | 19 |
--------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("TIME_ID"<=TO_DATE(' 2001-07-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) 14 rows selected. SQL>
SQL> select count(*) from sh.sales where time_id between to_date('01-JAN-2001','DD-MON-YYYY') and to_date('31-JUL-2001','DD-MON-YYYY');

COUNT(*)
----------
145760

SQL>
SQL> explain plan for
2 select cust_id,time_id,quantity_sold
3 from sh.sales where cust_id in (10,100,1001);

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 2821443835

----------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
----------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 390 | 6240 | 158 (0)| 00:00:02 | | |
| 1 | PARTITION RANGE ALL | | 390 | 6240 | 158 (0)| 00:00:02 | 1 | 28 |
| 2 | INLIST ITERATOR | | | | | | | |
| 3 | TABLE ACCESS BY LOCAL INDEX ROWID| SALES | 390 | 6240 | 158 (0)| 00:00:02 | 1 | 28 |
| 4 | BITMAP CONVERSION TO ROWIDS | | | | | | | |
|* 5 | BITMAP INDEX SINGLE VALUE | SALES_CUST_BIX | | | | | 1 | 28 |
----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

5 - access("CUST_ID"=10 OR "CUST_ID"=100 OR "CUST_ID"=1001)

17 rows selected.

SQL>
SQL> select count(*) from sh.sales where cust_id in (10,100,1001);

COUNT(*)
----------
252

SQL>



Let's compare the two runs.

The first run of queries is with granularity=>'PARTITION'.
In this case, for the CUST_ID, Oracle estimates 3,203 distinct values and no Histogram. It actually did not gather statistics based on any sampling of CUST_IDs. !
For the query on CUST_ID IN (10,100,1001), Oracle estimates a Cardinality of 861 rows (while the actual count is 252 rows).

The second run of queries is with granularity=>'ALL'.
Here, for the CUST_ID, Oracle estimates 7,059 distinct values, based on a sampling of 918,843 rows. It hsa not yet created a Histogram.
For the query on CUST_ID IN (10,100,1001), Oracle estimates a Cardinality of 390 rows, which is closer to the actual count of 252 rows.

Thus, although gathering statistics only at the Partition Level may make sense where your query predicates are always based on the Partition Key and Partition Pruning can happen, Table Level statistics are necessary for queries where Partition Pruning cannot be done.

.
.

And I have not even started improving the quality of the column statistics with the correct FOR COLUMNS SIZE specification for METHOD_OPT !
.
.
UPDATE : 08-Sep : Randolf Geist has pointed out "I think it's important to point out that partition or subpartition level statistics only get used if the optimizer is able to prune to a single partition/subpartition. Otherwise the next level statistics (partition/global) will get used.".
This also ties in with MetaLink Note#166215.1

So if I rerun the test where the query will hit a *single* partition only, I get :

SQL> explain plan for
2 select cust_id,time_id,quantity_sold
3 from sh.sales where time_id between to_date('01-JAN-2001','DD-MON-YYYY') and to_date('28-FEB-2001','DD-MON-YYYY');

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 4096232376

------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 39925 | 623K| 35 (3)| 00:00:01 | | |
| 1 | PARTITION RANGE SINGLE| | 39925 | 623K| 35 (3)| 00:00:01 | 17 | 17 |
|* 2 | TABLE ACCESS FULL | SALES | 39925 | 623K| 35 (3)| 00:00:01 | 17 | 17 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("TIME_ID"<=TO_DATE(' 2001-02-28 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) 14 rows selected. SQL>
SQL> select count(*) from sh.sales where time_id between to_date('01-JAN-2001','DD-MON-YYYY') and to_date('28-FEB-2001','DD-MON-YYYY');

COUNT(*)
----------
39924

SQL>

Thus, when the query is against a single partition -- such that Oracle can apply partition pruning -- the statistics from that specific partition are used. The cardinality estimate is more accurate -- coming to 39,925 against the actual count of 39,924.


.
.
.

5 comments:

Randolf said...

Hemant,

I think it's important to point out that partition or subpartition level statistics only get used if the optimizer is able to prune to a single partition/subpartition. Otherwise the next level statistics (partition/global) will get used.

Randolf

Hemant K Chitale said...

Randolf,
I've updated my post.

Anand said...

Hi HEmant Sir,

You stated:-
"And I have not even started improving the quality of the column statistics with the correct FOR COLUMNS SIZE specification for METHOD_OPT !"

How do we come to know what will be the correct specification for the METHOD OPTS "FOR COLUMN SIZE xxx"


Regards,
Anand

Hemant K Chitale said...

Anand,

With knowledge of the data -- e.g. even a simple "SELECT column, count(*) FROM table GROUP BY column ORDER BY column" can tell us about data skew.

If the data is skewed and we know that our application runs queries (using literals) for the skewed values, a Histogram would be helpful. Note that Histograms can be used by the Optimizer for selecting between Join options even when the query doesn't specify filter predicates on the columns, only Joins.

Hemant

Rahul said...

Hi,

There are lots of way to gather statistics of partition tables.

Thanks and regards,
Rahul