13 November, 2009

Sample Sizes : Table level and Column level

There appears to be some confusion about Sample Size that the "Auto" behaviour in DBMS_STATS determines.

The AUTO_SAMPLE_SIZE for the "ESTIMATE_PERCENT" parameter doesn't mean the size as "SIZE AUTO" for the "METHOD_OPT" parameter.
The former is for Table level statistics.
The latter is for Column Histograms. NULLs are not included in Histograms.

Here is a demonstration :

SQL> drop table Test_Stats_Size purge;

Table dropped.

SQL>
SQL> create table Test_Stats_Size
2 as select object_id col_1, owner col_2, object_name col_3, created col_4
3 from dba_objects
4 where 1=2
5 /

Table created.

SQL> alter table test_stats_size add (padding_col varchar2(80));

Table altered.

SQL> alter table test_stats_size nologging;

Table altered.

SQL>
SQL> insert /*+ APPEND */ into test_stats_size
2 select mod(object_id,50), owner, object_name, created, dbms_random.string('X',60) from dba_objects
3 union all
4 select mod(object_id,50), owner, object_name, created, dbms_random.string('X',60) from dba_objects
5 union all
6 select mod(object_id,50), owner, object_name, created, dbms_random.string('X',60) from dba_objects
7 union all
8 select mod(object_id,50), owner, object_name, created, dbms_random.string('X',60) from dba_objects
9 union all
10 select mod(object_id,50), owner, object_name, created, dbms_random.string('X',60) from dba_objects
11 union all
12 select mod(object_id,50), owner, object_name, created, dbms_random.string('X',60) from dba_objects
13 union all
14 select mod(object_id,50), owner, object_name, created, dbms_random.string('X',60) from dba_objects
15 /

354501 rows created.

SQL>
SQL> commit;

Commit complete.

SQL>
SQL> select count(*) from test_stats_size ;

COUNT(*)
----------
354501

SQL> select count(*) from test_stats_size where col_2 = 'HEMANT';

COUNT(*)
----------
245

SQL>
SQL>
SQL> -- Gather Stats with 100% sampling but Auto Size for columns
SQL> exec dbms_stats.gather_table_stats('','TEST_STATS_SIZE',estimate_percent=>100,method_opt=>'FOR ALL COLUMNS SIZE AUTO');

PL/SQL procedure successfully completed.

SQL>
SQL> select table_name, num_rows, sample_size, sample_size*100/num_rows Sample_Perc
2 from user_tables
3 where table_name = 'TEST_STATS_SIZE'
4 /

TABLE_NAME NUM_ROWS SAMPLE_SIZE SAMPLE_PERC
------------------------------ ---------- ----------- -----------
TEST_STATS_SIZE 354501 354501 100

SQL> REM We see above that Table Level Sampling is 100%
SQL>
SQL> select column_name, num_distinct, sample_size, histogram , num_buckets
2 from user_tab_columns
3 where table_name = 'TEST_STATS_SIZE'
4 order by column_name
5 /

COLUMN_NAME NUM_DISTINCT SAMPLE_SIZE HISTOGRAM NUM_BUCKETS
------------------------------ ------------ ----------- --------------- -----------
COL_1 50 354494 NONE 1
COL_2 28 354501 FREQUENCY 28
COL_3 30311 354501 NONE 1
COL_4 1411 354501 NONE 1
PADDING_COL 354501 354501 NONE 1

SQL>
SQL> REM Was Column Level Sampling 100% ?
SQL> REM What type of Histograms and how many buckets did Oracle create ?
SQL> REM We see above that not necessarily all columns have the same number of rows sampled
SQL> REM Also, not necessarily all columns have a Histogram !
SQL>
SQL> REM Let's see what happens after a couple of queries on col_1
SQL>
SQL> select count(*) from test_stats_size where col_1 is null;

COUNT(*)
----------
7

SQL> select count(distinct(col_1)) from test_stats_size;

COUNT(DISTINCT(COL_1))
----------------------
50

SQL> select count(*) from test_stats_size where col_1 between 10 and 25;

COUNT(*)
----------
113092

SQL>
SQL> -- Gather Stats with 100% sampling but Auto Size for columns
SQL> exec dbms_stats.gather_table_stats('','TEST_STATS_SIZE',estimate_percent=>100,method_opt=>'FOR ALL COLUMNS SIZE AUTO');

PL/SQL procedure successfully completed.

SQL>
SQL> select table_name, num_rows, sample_size, sample_size*100/num_rows Sample_Perc
2 from user_tables
3 where table_name = 'TEST_STATS_SIZE'
4 /

TABLE_NAME NUM_ROWS SAMPLE_SIZE SAMPLE_PERC
------------------------------ ---------- ----------- -----------
TEST_STATS_SIZE 354501 354501 100

SQL> REM We see above that Table Level Sampling is 100%
SQL>
SQL> select column_name, num_distinct, sample_size, histogram , num_buckets
2 from user_tab_columns
3 where table_name = 'TEST_STATS_SIZE'
4 order by column_name
5 /

COLUMN_NAME NUM_DISTINCT SAMPLE_SIZE HISTOGRAM NUM_BUCKETS
------------------------------ ------------ ----------- --------------- -----------
COL_1 50 354494 FREQUENCY 50
COL_2 28 354501 FREQUENCY 28
COL_3 30311 354501 NONE 1
COL_4 1411 354501 NONE 1
PADDING_COL 354501 354501 NONE 1

SQL>
SQL> REM However, Column Level Sampling and Histogram collection for col_1 is now different !
SQL> REM Also notice that the 7 NULLs in col_1 are excluded !
SQL>
SQL>

And why did Oracle now gather statistics on col_1 ? See my earlier postings on COL_USAGE influencing SIZE AUTO.
.
.
.

3 comments:

Santosh Kumar said...

Nice explanation. Thanks Hemant.

Anonymous said...

Hemant, thanks a lot for the knowledge, i did know that column level stats are done based on what queries are already run for that column.

Thanks
Deen

Hemant K Chitale said...

Anonymous,

"column level stats are done based on what queries are already run for that column" yes, when using "SIZE AUTO" (which is the default if method_opt is not specified).

Hemant K CHitale