21 November, 2009

SIZE specification for Column Histograms

When you use the method_opt parameter in DBMS_STATS.GATHER_TABLE_STATS specifying 'FOR [ALL] COLUMNS [column names] SIZE n' remember to lookup the syntax again. I made the mistake of not verifying the syntax for specifying SIZE when I name columns in the FOR ... SIZE clause.

Simply put, if you specify FOR ALL COLUMNS, the SIZE specification of 'n' does apply to every column. But if you specify individual column names and the SIZE is specified last, Oracle applies the SIZE specification only to the last named column while using it's own limit of 75 buckets where you do not specify a value for SIZE (and haven't specified SIZE AUTO). This means that if you have more than 75 (but less than 250) distinct values, Oracle creates a Height Balanced Histogram instead of the Frequency Histogram that you had desired. (Of course, if you have less than 75 distinct values, Oracle does create the right Frequency Histogram with the corresponding number of buckets to capture each distinct value).

Thus, in this demo, the first GATHER_TABLE_STATS run creates the proper histograms for all 5 columns because I specify "FOR ALL COLUMNS SIZE 250".
However, in the second run, when I name each individual column but do not attach a SIZE specification, Oracle trims columns C2_100D_NUMBERS and C4_100D_VARCHARS down to Height Balanced Histograms with 75 buckets only.
In the third run, I attache a SIZE specification to each column name and achieve the desired Histograms.


SQL> col column_name format a30
SQL>
SQL> Drop Table TEST_SIZE_SPEC purge;

Table dropped.

SQL>
SQL> Create Table TEST_SIZE_SPEC (
2 c1_10d_numbers number not null,
3 c2_100d_numbers number not null,
4 c3_10d_varchars varchar2(5) not null,
5 c4_100d_varchars varchar2(5) not null,
6 c5_dummy number not null)
7 /

Table created.

SQL>
SQL> declare
2 i number;
3 begin
4 for i in 1..10000
5 loop
6 insert into TEST_SIZE_SPEC values (
7 mod(i,10), mod(i,100), 'C3_'||mod(i,10),'C4_'||mod(i,100), i
8 );
9 end loop;
10 end;
11 /

PL/SQL procedure successfully completed.

SQL>
SQL> select count(distinct(c1_10d_numbers)) from TEST_SIZE_SPEC;

COUNT(DISTINCT(C1_10D_NUMBERS))
-------------------------------
10

SQL> select count(distinct(c2_100d_numbers)) from TEST_SIZE_SPEC;

COUNT(DISTINCT(C2_100D_NUMBERS))
--------------------------------
100

SQL> select count(distinct(c3_10d_varchars)) from TEST_SIZE_SPEC;

COUNT(DISTINCT(C3_10D_VARCHARS))
--------------------------------
10

SQL> select count(distinct(c4_100d_varchars)) from TEST_SIZE_SPEC;

COUNT(DISTINCT(C4_100D_VARCHARS))
---------------------------------
100

SQL> select count(distinct(c5_dummy)) from TEST_SIZE_SPEC;

COUNT(DISTINCT(C5_DUMMY))
-------------------------
10000

SQL>
SQL>
SQL> REM =========================================================================================================
SQL> REM =========================================================================================================
SQL> -- Gather stats with 100% sampling but SIZE 250
SQL> exec dbms_stats.gather_table_stats('','TEST_SIZE_SPEC',estimate_percent=>100,method_opt=>'FOR ALL COLUMNS SIZE 250');

PL/SQL procedure successfully completed.

SQL>
SQL> -- Review column statistics, histograms
SQL> select column_name, sample_size, num_distinct, histogram, num_buckets
2 from user_tab_columns
3 where table_name = 'TEST_SIZE_SPEC'
4 order by column_name
5 /

COLUMN_NAME SAMPLE_SIZE NUM_DISTINCT HISTOGRAM NUM_BUCKETS
------------------------------ ----------- ------------ --------------- -----------
C1_10D_NUMBERS 10000 10 FREQUENCY 10
C2_100D_NUMBERS 10000 100 FREQUENCY 100
C3_10D_VARCHARS 10000 10 FREQUENCY 10
C4_100D_VARCHARS 10000 100 FREQUENCY 100
C5_DUMMY 10000 10000 HEIGHT BALANCED 250

SQL>
SQL> select column_name, count(*)
2 from user_tab_histograms
3 where table_name = 'TEST_SIZE_SPEC'
4 group by column_name
5 order by column_name
6 /

COLUMN_NAME COUNT(*)
------------------------------ ----------
C1_10D_NUMBERS 10
C2_100D_NUMBERS 100
C3_10D_VARCHARS 10
C4_100D_VARCHARS 100
C5_DUMMY 251

SQL>
SQL> REM =========================================================================================================
SQL> REM =========================================================================================================
SQL> -- Now, reGather statistics but specifying column names
SQL> exec dbms_stats.gather_table_stats('','TEST_SIZE_SPEC',estimate_percent=>100,-
> method_opt=>'FOR COLUMNS C1_10D_NUMBERS, C2_100D_NUMBERS, C3_10D_VARCHARS, C4_100D_VARCHARS, C5_DUMMY SIZE 250');

PL/SQL procedure successfully completed.

SQL>
SQL> -- Review column statistics, histograms
SQL> select column_name, sample_size, num_distinct, histogram, num_buckets
2 from user_tab_columns
3 where table_name = 'TEST_SIZE_SPEC'
4 order by column_name
5 /

COLUMN_NAME SAMPLE_SIZE NUM_DISTINCT HISTOGRAM NUM_BUCKETS
------------------------------ ----------- ------------ --------------- -----------
C1_10D_NUMBERS 10000 10 FREQUENCY 10
C2_100D_NUMBERS 10000 100 HEIGHT BALANCED 75
C3_10D_VARCHARS 10000 10 FREQUENCY 10
C4_100D_VARCHARS 10000 100 HEIGHT BALANCED 75
C5_DUMMY 10000 10000 HEIGHT BALANCED 250

SQL>
SQL> select column_name, count(*)
2 from user_tab_histograms
3 where table_name = 'TEST_SIZE_SPEC'
4 group by column_name
5 order by column_name
6 /

COLUMN_NAME COUNT(*)
------------------------------ ----------
C1_10D_NUMBERS 10
C2_100D_NUMBERS 76
C3_10D_VARCHARS 10
C4_100D_VARCHARS 76
C5_DUMMY 251

SQL>
SQL>
SQL> REM =========================================================================================================
SQL> REM =========================================================================================================
SQL> -- Now, reGather statistics but specifying column names and size each
SQL> exec dbms_stats.gather_table_stats('','TEST_SIZE_SPEC',estimate_percent=>100,-
> method_opt=>'FOR COLUMNS C1_10D_NUMBERS SIZE 250, C2_100D_NUMBERS SIZE 250, C3_10D_VARCHARS SIZE 250, C4_100D_VARCHARS SIZE 250, C5_DUMMY SIZE 250');

PL/SQL procedure successfully completed.

SQL>
SQL> -- Review column statistics, histograms
SQL> select column_name, sample_size, num_distinct, histogram, num_buckets
2 from user_tab_columns
3 where table_name = 'TEST_SIZE_SPEC'
4 order by column_name
5 /

COLUMN_NAME SAMPLE_SIZE NUM_DISTINCT HISTOGRAM NUM_BUCKETS
------------------------------ ----------- ------------ --------------- -----------
C1_10D_NUMBERS 10000 10 FREQUENCY 10
C2_100D_NUMBERS 10000 100 FREQUENCY 100
C3_10D_VARCHARS 10000 10 FREQUENCY 10
C4_100D_VARCHARS 10000 100 FREQUENCY 100
C5_DUMMY 10000 10000 HEIGHT BALANCED 250

SQL>
SQL> select column_name, count(*)
2 from user_tab_histograms
3 where table_name = 'TEST_SIZE_SPEC'
4 group by column_name
5 order by column_name
6 /

COLUMN_NAME COUNT(*)
------------------------------ ----------
C1_10D_NUMBERS 10
C2_100D_NUMBERS 100
C3_10D_VARCHARS 10
C4_100D_VARCHARS 100
C5_DUMMY 251

SQL>



Therefore, be careful when you name individual columns in the method_opt parameter 'FOR COLUMNS ...'. Ensure that you specify the desired SIZE for each column separately (even if the SIZE is to be the same -- i.e. 250 in my case).


UPDATE : As pointed out by Randolf, an alternative command is
"FOR COLUMNS SIZE 250 C1_10D_NUMBERS, C2_100D_NUMBERS, C3_10D_VARCHARS, C4_100D_VARCHARS, C5_DUMMY"
so that the SIZE n specification does not have to be repeated after every column name.
.
.
.

6 comments:

Randolf said...

Hemant,

instead of explicitly mentioning the SIZE for every column, there is an alternative of the syntax that might be useful in this case:

FOR COLUMNS SIZE 250 C1_10D_NUMBERS, C2_100D_NUMBERS, C3_10D_VARCHARS, C4_100D_VARCHARS, C5_DUMMY

So you can effectively apply different bucket sizes to different lists of columns.

Regards,
Randolf

Oracle related stuff blog:
http://oracle-randolf.blogspot.com/

Co-author of the upcoming "OakTable Expert Oracle Practices" book:
http://www.apress.com/book/view/1430226684
http://www.amazon.de/Expert-Oracle-Practices-Database-Administration/dp/1430226684

Hemant K Chitale said...

Randolf,
Yes, that is a good alternative as well -- to provide the SIZE n value upfront.
Thanks.
Hemant

Home Audio Accessories said...

thanks for the syntax, it help me alot.

Anonymous said...

From Wolfgang Breitling's paper:
"Statistics Gathering and Histograms DOs ✔ and DON’Ts"

He wrote:
Do use “for all columns size 1”
then
✔ “for columns size 254 col1[,col2,...]”
✘ not “for columns col1[,col2,...] size 254” *
for columns requiring histograms.

and

How To Quickly Add/Remove Column Statistics (Histograms) For A Column [ID 390249.1]

•gathering statistics for multiple columns with different bucket sizes :SQL> exec dbms_stats.gather_table_stats(null, '', cascade=>false, method_opt => 'for columns size N , for columns size M ,');
•gathering statistics for multiple columns with different bucket sizes plus minimal information for the remaining columns:SQL> exec dbms_stats.gather_table_stats(null, '', cascade=>false, method_opt => 'for all columns size 1 for columns size N , for columns size M ,');

ganez said...

Hi...

Could you tell me what is the difference between FOR COLUMNS SIZE 250 and FOR COLUMNS SIZE 250 C1_10D_NUMBERS, C2_100D_NUMBERS, C3_10D_VARCHARS, C4_100D_VARCHARS, C5_DUMMY?

I meant instead of giving size in each columns, Shall we give simply like FOR COLUMNS SIZE 250. Am i right? please clarify me...

Thanks a lot...ganesh

Hemant K Chitale said...

Ganesh,

When not using FOR ALL COLUMNS, you could specify individual columns (e.g. 3 of the 5 columns here) with the FOR COLUMNS clause.