10 November, 2008

Histogram (skew) on Unique Values

Till a few days ago, I thought that, since NOT NULL column have only Unique Values would not need a Histogram, Oracle would not generate a Histogram, there being no data skew when no value is repeated.

However, Randolf Geist, in a recent forums.oracle.com thread, suggested that it is possible for Oracle's "auto" behaviour to generate a Histogram on such data where there is a noticeable gap in the data. That seems to be a result of the way the skewness detection algorith works.

So I set out to test it. This quick test is on 10.2.0.4 (64bit Linux). I have 105,329 unique values ranging from "2" to "259290". I have some gaps in the values, but I ensure that Uniqueness is maintained.


SQL>
SQL> drop table t_ah_o_g_u ;

Table dropped.

SQL> create table t_ah_o_g_u (gapped_uniq_number number not null, a_string varchar2(128));

Table created.

SQL>
SQL> create unique index t_ah_o_g_u_uk on t_ah_o_g_u (gapped_uniq_number);

Index created.

SQL>
SQL> REM Insert Unique Values but with significant gaps between Ranges
SQL> insert into t_ah_o_g_u select object_id, object_name from dba_objects where object_id between 2 and 20;

19 rows created.

SQL> insert into t_ah_o_g_u select object_id, object_name from dba_objects where object_id between 30000 and 32000;

2001 rows created.

SQL> insert into t_ah_o_g_u select object_id, object_name from dba_objects where object_id > 50000;

2087 rows created.

SQL> insert into t_ah_o_g_u select object_id+100000, object_name from dba_objects ;

50611 rows created.

SQL> insert into t_ah_o_g_u select object_id+200000, object_name from dba_objects ;

50611 rows created.

SQL>
SQL>
SQL> select count(*) from t_ah_o_g_u;

COUNT(*)
----------
105329

SQL> REM let's check the skew
SQL> select min(gapped_uniq_number), max(gapped_uniq_number) from t_ah_o_g_u;

MIN(GAPPED_UNIQ_NUMBER) MAX(GAPPED_UNIQ_NUMBER)
----------------------- -----------------------
2 259290

SQL> select count(*) from t_ah_o_g_u where gapped_uniq_number < 50000 ;

COUNT(*)
----------
2020

SQL> select count(*) from t_ah_o_g_u where gapped_uniq_number between 50000 and 100000;

COUNT(*)
----------
2087

SQL> select count(*) from t_ah_o_g_u where gapped_uniq_number between 100000 and 200000;

COUNT(*)
----------
50611

SQL> select count(*) from t_ah_o_g_u where gapped_uniq_number > 200000;

COUNT(*)
----------
50611

SQL>
SQL> REM =============================================================================================
SQL> REM =============================================================================================
SQL> REM We know that where the column is a Unique Key, there cn be no data skew -- and we need no histogram
SQL> REM Does Oracle's autogather create a histogram ?
SQL>
SQL> REM =============================================================================================
SQL> REM with SIZE AUTO
SQL> exec dbms_stats.gather_table_stats(user,'T_AH_O_G_U',method_opt=>'FOR ALL COLUMNS SIZE AUTO', estimate_percent=>100, cascade=>TRUE);

PL/SQL procedure successfully completed.

SQL>
SQL> REM Let's look for histograms
SQL> select num_rows, sample_size
2 from user_tables where table_name = 'T_AH_O_G_U';

NUM_ROWS SAMPLE_SIZE
---------- -----------
105329 105329

SQL> select low_value, high_value, num_distinct,num_buckets,sample_size, histogram
2 from user_tab_col_statistics where table_name = 'T_AH_O_G_U' and column_name = 'GAPPED_UNIQ_NUMBER';

LOW_VALUE HIGH_VALUE
---------------------------------------------------------------- ----------------------------------------------------------------
NUM_DISTINCT NUM_BUCKETS SAMPLE_SIZE HISTOGRAM
------------ ----------- ----------- ---------------
C103 C31A5D5B
105329 254 105329 HEIGHT BALANCED


SQL> select count(*)
2 from user_tab_histograms where table_name = 'T_AH_O_G_U' and column_name = 'GAPPED_UNIQ_NUMBER';

COUNT(*)
----------
255

SQL> REM =============================================================================================
SQL> REM =============================================================================================
SQL> REM !!!! Yes, the SIZE AUTO did create a Height Balanced Histogram with 254 buckets !
SQL> REM =============================================================================================
SQL> REM =============================================================================================
SQL>
SQL> REM =============================================================================================
SQL> REM with SIZE SKEWONLY
SQL> exec dbms_stats.delete_table_stats(user,'T_AH_O_G_U');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.gather_table_stats(user,'T_AH_O_G_U',method_opt=>'FOR ALL COLUMNS SIZE SKEWONLY', estimate_percent=>100, cascade=>TRUE);

PL/SQL procedure successfully completed.

SQL>
SQL> REM Let's look for histograms
SQL> select num_rows, sample_size
2 from user_tables where table_name = 'T_AH_O_G_U';

NUM_ROWS SAMPLE_SIZE
---------- -----------
105329 105329

SQL> select low_value, high_value, num_distinct,num_buckets,sample_size, histogram
2 from user_tab_col_statistics where table_name = 'T_AH_O_G_U' and column_name = 'GAPPED_UNIQ_NUMBER';

LOW_VALUE HIGH_VALUE
---------------------------------------------------------------- ----------------------------------------------------------------
NUM_DISTINCT NUM_BUCKETS SAMPLE_SIZE HISTOGRAM
------------ ----------- ----------- ---------------
C103 C31A5D5B
105329 254 105329 HEIGHT BALANCED


SQL> select count(*)
2 from user_tab_histograms where table_name = 'T_AH_O_G_U' and column_name = 'GAPPED_UNIQ_NUMBER';

COUNT(*)
----------
255

SQL> REM =============================================================================================
SQL> REM =============================================================================================
SQL> REM !!!! Yes, the SKEWONLY did create a Height Balanced Histogram with 254 buckets !
SQL> REM =============================================================================================
SQL> REM =============================================================================================
SQL>
SQL>
SQL>
SQL> REM =============================================================================================
SQL> REM with SIZE 1 -- reverting to the 9i default behaviour
SQL> exec dbms_stats.delete_table_stats(user,'T_AH_O_G_U');

PL/SQL procedure successfully completed.

SQL> exec dbms_stats.gather_table_stats(user,'T_AH_O_G_U',method_opt=>'FOR ALL COLUMNS SIZE 1', estimate_percent=>100, cascade=>TRUE);

PL/SQL procedure successfully completed.

SQL>
SQL> REM Let's look for histograms
SQL> select num_rows, sample_size
2 from user_tables where table_name = 'T_AH_O_G_U';

NUM_ROWS SAMPLE_SIZE
---------- -----------
105329 105329

SQL> select low_value, high_value, num_distinct,num_buckets,sample_size, histogram
2 from user_tab_col_statistics where table_name = 'T_AH_O_G_U' and column_name = 'GAPPED_UNIQ_NUMBER';

LOW_VALUE HIGH_VALUE
---------------------------------------------------------------- ----------------------------------------------------------------
NUM_DISTINCT NUM_BUCKETS SAMPLE_SIZE HISTOGRAM
------------ ----------- ----------- ---------------
C103 C31A5D5B
105329 1 105329 NONE


SQL> select count(*)
2 from user_tab_histograms where table_name = 'T_AH_O_G_U' and column_name = 'GAPPED_UNIQ_NUMBER';

COUNT(*)
----------
2

SQL> REM =============================================================================================
SQL> REM This is the correct (ie as I expect it to be) behaviour ! -- No Histogram on a Unique Key
SQL> REM =============================================================================================
SQL>
SQL> spool off


So we do have a case where SIZE AUTO and SIZE SKEWONLY create a Histogram on Unique Values.

Would we have such data in real life ? Certainly.
These numbers could be, say, Identifier numbers for, say, a Rewards Program membership, which are reset every 6 months or every year such that a large gap is created between each set of numbers (the Sequence generating the numbers is recreated with a large, gapped, START WITH value over the previous incarnation's LAST_NUMBER).
These could be Product Codes for different sets of products , which a range (0 to 50,000 for one Business Unit or Subsidiary, 50,000 to 100,000 for another Business Unit or subsidiary etc) allocated to each Business Unit/Subsidiary. These could be Bar Code numbers generated in batches with a deliberately created gap to preven overlapping.

2 comments:

robert said...

You say "This is the correct (ie as I expect it to be) behaviour ! -- No Histogram on a Unique Key" in the comment to your last test. The word "correct" makes it sound as if Oracle is in error here. But, while usually unique indexes are used for lookups (i.e. equality test on the column(s)), as soon as you do a range scan a histogram might be very beneficial with skewed data. This becomes even more likely for a multi column unique index where there is repetition of individual values. My 0.02EUR...

Hemant K Chitale said...

In the case where there really is "clustering" of values (eg very many more "gapped_uniq_number" values in the 50,000 to 100,000 range than in the 0 to 50,000 and 100,000 to 200,000 ranges) [one Business Unit having very many more products than the other Business Units ?, to take on of the examples I provided], Oracle might need to take advantage of a histogram which shows "skew".
So, I will concede that possibility.