30 August, 2009

Histograms on "larger" columns


UPDATE :  These figures (the 254 distinct values, the 32 characters/bytes limit etc) were in tests on Release 10.2
Release 12.1 has relaxed some of the limits and has new Histogram types.

We rely on Oracle's collection of Histograms to denote data skew. However, if we give it some thought, it should be fairly obvious that generation of such histograms is a resource intensive task. The kernel has to identify all the distinct values for a column and, for each distinct value, count the number of occurrences of each such value.
If there are very many distinct values (more than 254) Oracle actually creates what it calls a Height Balanced Histogram, instead of a Frequency Histogram. (If the DBA explicitly specifies a number of buckets for column statistics and Oracle determines that the distinct count is more than such number, then, too, Oracle creates a Height Balanced Histogram).

Thus, it is fairly understood that
a. The number of distinct values
and
b. The number of rows in the table or sample
would determine the "effort" required to create a Histogram.

However, we sometimes forget that how "large" the column is -- in terms of how many characters (or bytes) are present in each occurrence [row] -- is also important. It would require more time and effort to compute the number of distinct values for a column with a length of 36 characters then it would for, say, a column with a length of 6 characters only. Sorting 36-character elements requires more memory/temp space and CPU (and I/O) then sorting 6-character elements. As also would be the effort to count the number of occurrences of each distinct value. (There may well be "word-size" limitations of the algorithms used as well !)

Therefore, when it comes to deriving a Histogram on such "large" columns, Oracle has to draw the line somewhere.

If the first 6 characters (actually, bytes !) of a column are the same, Oracle doesn't actually place the column values in ENDPOINT_VALUE in the %_TAB_HISTOGRAMS table. The values are placed in ENDPOINT_ACTUAL_VALUE !

However, if the first 32 characters (actually, bytes !) are the same, Oracle doesn't compare the rest of the characters in that row. Effectively, no Histogram is computed for a column where the first 32 characters (bytes) are the same.


Here I conduct a small test with columns of different values with 'N' leading characters being the same.

I first create a 10,000 row table with 4 columns and 10 distinct values in each column (with exactly 1,000 rows for each value) such that :

a. In Column A_SMALLER_COLUMN, only the first 2 characters are always the same and the third character varies
b. In Column B_LARGER_COLUMN, the first 6 characters are always the same
c. In Column C_GREATEST_COLUMN, the first 32 characters are always the same
d. In Column FIRST_FIVE_SAME, the first 5 characters the always the same


SQL> create table TEST_HSTGRM
2  (
3  a_smaller_column  varchar2(5),
4  b_larger_column varchar2(20),
5  c_greatest_column varchar2(60),
6  first_five_same varchar2(8)
7  );

Table created.

SQL>
SQL> declare
2  num_rows number;
3
4  begin
5  for num_rows in 1..10000
6  loop
7   insert into TEST_HSTGRM
8   values (
9           'A_'||mod(num_rows,10)||'_X',
10           'Large_'||mod(num_rows,10)||'_XY',
11           'C123_ABCDEFGHIJKLMNOPQRSTUVWXYZ_'||mod(num_rows,10)||'_XYZ',
12           'Five_'||mod(num_rows,10));
13  end loop;
14  end;
15  /

PL/SQL procedure successfully completed.

SQL>
SQL>
SQL> exec dbms_stats.gather_table_stats('','TEST_HSTGRM',estimate_percent=>100,-
> method_opt=>'FOR ALL COLUMNS SIZE 254');

PL/SQL procedure successfully completed. 



I next see what column statistic Oracle has collected :
SQL> col column_name format a18
SQL> col endpoint_value format 9999999999999999999999999999999999999
SQL> col endpoint_actual_value format a30
SQL>
SQL> select column_name, sample_size, num_distinct, num_buckets, histogram,
2  low_value, high_value
3  from user_tab_columns
4  where table_name = 'TEST_HSTGRM';

COLUMN_NAME        SAMPLE_SIZE NUM_DISTINCT NUM_BUCKETS HISTOGRAM
------------------ ----------- ------------ ----------- ---------------
LOW_VALUE
----------------------------------------------------------------
HIGH_VALUE
----------------------------------------------------------------
A_SMALLER_COLUMN         10000           10          10 FREQUENCY
415F305F58
415F395F58

B_LARGER_COLUMN          10000           10          10 FREQUENCY
4C617267655F305F5859
4C617267655F395F5859

C_GREATEST_COLUMN        10000            1           1 FREQUENCY
433132335F4142434445464748494A4B4C4D4E4F505152535455565758595A5F
433132335F4142434445464748494A4B4C4D4E4F505152535455565758595A5F

FIRST_FIVE_SAME          10000           10          10 FREQUENCY
466976655F30
466976655F39


SQL>
SQL> select column_name, count(*)
2  from user_tab_histograms
3  where table_name = 'TEST_HSTGRM'
4  group by column_name order by 1;

COLUMN_NAME          COUNT(*)
------------------ ----------
A_SMALLER_COLUMN           10
B_LARGER_COLUMN            10
C_GREATEST_COLUMN           1
FIRST_FIVE_SAME            10

SQL> 


Notice how Oracle could not (did not) compute a Histogram for C_GREATEST_COLUMN. With the first 32 characters being the same, Oracle just says that there is "1 distinct value". The LOW_VALUE and HIGH_VALUE are the same.

Next, I look at the actual Histograms on these columns :
SQL> set linesize132
SQL> select column_name, endpoint_value, endpoint_actual_value, endpoint_number
2  from user_tab_histograms
3  where table_name = 'TEST_HSTGRM' order by 1,3,2;

COLUMN_NAME                                ENDPOINT_VALUE ENDPOINT_ACTUAL_VALUE          ENDPOINT_NUMBER
------------------ -------------------------------------- ------------------------------ ---------------
A_SMALLER_COLUMN     339429957176373000000000000000000000                                           1000
A_SMALLER_COLUMN     339430036404535000000000000000000000                                           2000
A_SMALLER_COLUMN     339430115632698000000000000000000000                                           3000
A_SMALLER_COLUMN     339430194860860000000000000000000000                                           4000
A_SMALLER_COLUMN     339430274089023000000000000000000000                                           5000
A_SMALLER_COLUMN     339430353317185000000000000000000000                                           6000
A_SMALLER_COLUMN     339430432545348000000000000000000000                                           7000
A_SMALLER_COLUMN     339430511773510000000000000000000000                                           8000
A_SMALLER_COLUMN     339430591001673000000000000000000000                                           9000
A_SMALLER_COLUMN     339430670229835000000000000000000000                                          10000
B_LARGER_COLUMN      396591018990235000000000000000000000 Large_0_XY                                1000
B_LARGER_COLUMN      396591018990235000000000000000000000 Large_1_XY                                2000
B_LARGER_COLUMN      396591018990235000000000000000000000 Large_2_XY                                3000
B_LARGER_COLUMN      396591018990235000000000000000000000 Large_3_XY                                4000
B_LARGER_COLUMN      396591018990235000000000000000000000 Large_4_XY                                5000
B_LARGER_COLUMN      396591018990235000000000000000000000 Large_5_XY                                6000
B_LARGER_COLUMN      396591018990235000000000000000000000 Large_6_XY                                7000
B_LARGER_COLUMN      396591018990235000000000000000000000 Large_7_XY                                8000
B_LARGER_COLUMN      396591018990235000000000000000000000 Large_8_XY                                9000
B_LARGER_COLUMN      396591018990235000000000000000000000 Large_9_XY                               10000
C_GREATEST_COLUMN    348881704899430000000000000000000000                                          10000
FIRST_FIVE_SAME      365599813402059000000000000000000000                                           1000
FIRST_FIVE_SAME      365599813402063000000000000000000000                                           2000
FIRST_FIVE_SAME      365599813402068000000000000000000000                                           3000
FIRST_FIVE_SAME      365599813402073000000000000000000000                                           4000
FIRST_FIVE_SAME      365599813402078000000000000000000000                                           5000
FIRST_FIVE_SAME      365599813402082000000000000000000000                                           6000
FIRST_FIVE_SAME      365599813402087000000000000000000000                                           7000
FIRST_FIVE_SAME      365599813402092000000000000000000000                                           8000
FIRST_FIVE_SAME      365599813402096000000000000000000000                                           9000
FIRST_FIVE_SAME      365599813402101000000000000000000000                                          10000

31 rows selected.

SQL> 


As we've already seen, there is no real Histogram for C_GREATEST_COLUMN. There is a single entry representing 10,000 rows.
However, for column B_LARGER_COLUMN (where the first 6 characters are the same in every row), the actual value for each Histogram endpoint is visible in the ENDPOINT_ACTUAL_VALUE column while ENDPOINT_VALUE appears to be the same.

Finally, I look at how Oracle uses the Histograms to come up with the Cardinality estimates :
SQL> explain plan for
 2  select a_smaller_column from TEST_HSTGRM where a_smaller_column = 'A_3_X';

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT                           
----------------------------------------------------------------------------------------------------
Plan hash value: 2702196353

---------------------------------------------------------------------------------
| Id  | Operation         | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |             |  1000 |  6000 |    27   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| TEST_HSTGRM |  1000 |  6000 |    27   (0)| 00:00:01 |
---------------------------------------------------------------------------------
  
Predicate Information (identified by operation id):
---------------------------------------------------
                                                                     
  1 - filter("A_SMALLER_COLUMN"='A_3_X')

13 rows selected.

SQL> select count(*) from TEST_HSTGRM where a_smaller_column = 'A_3_X';

 COUNT(*)                                                                                         
----------                
     1000

SQL> 
SQL> explain plan for
 2  select b_larger_column from TEST_HSTGRM where b_larger_column = 'Large_3_XY';

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 2702196353

---------------------------------------------------------------------------------
| Id  | Operation         | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |             |  1000 | 11000 |    27   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| TEST_HSTGRM |  1000 | 11000 |    27   (0)| 00:00:01 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

  1 - filter("B_LARGER_COLUMN"='Large_3_XY')

13 rows selected.

SQL> select count(*) from TEST_HSTGRM where b_larger_column = 'Large_3_XY';

 COUNT(*)
----------
     1000

SQL>
SQL> explain plan for
 2  select c_greatest_column from TEST_HSTGRM
 3  where c_greatest_column = 'C123_ABCDEFGHIJKLMNOPQRSTUVWXYZ_3_XYZ';

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 2702196353

---------------------------------------------------------------------------------
| Id  | Operation         | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |             | 10000 |   371K|    27   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| TEST_HSTGRM | 10000 |   371K|    27   (0)| 00:00:01 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

  1 - filter("C_GREATEST_COLUMN"='C123_ABCDEFGHIJKLMNOPQRSTUVWXYZ_3_XYZ'
             )

14 rows selected.

SQL> select count(*) from TEST_HSTGRM
 2  where c_greatest_column = 'C123_ABCDEFGHIJKLMNOPQRSTUVWXYZ_3_XYZ';

 COUNT(*)
----------
     1000

SQL>
SQL> explain plan for
 2  select first_five_same from TEST_HSTGRM where first_five_same = 'Five_3';

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 2702196353

---------------------------------------------------------------------------------
| Id  | Operation         | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |             |  1000 |  7000 |    27   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| TEST_HSTGRM |  1000 |  7000 |    27   (0)| 00:00:01 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

  1 - filter("FIRST_FIVE_SAME"='Five_3')

13 rows selected.

SQL> select count(*) from TEST_HSTGRM where first_five_same = 'Five_3';

 COUNT(*)
----------
     1000

SQL> 


For Column A_SMALLER_COLUMN, the Cardinality Estimate of 1,000 rows is accurate.
For Column B_LARGER_COLUMN, too, Cardinality Estimate is accurate (thus, we should look at ENDPOINT_ACTUAL_VALUE rather than ENDPOINT_VALUE in USER_TAB_HISTOGRAMS)
For Column FIRST_FIVE_SAME, we see similar results as A_SMALLER_COLUMN.

However, for column C_GREATEST_COLUMN, in the absence of the Histogram (which did not get computed), Oracle presents a Cardinality Estimate of 1 ! For this column, LOW_VALUE and HIGH_VALUE were the same and NUM_DISTINCT was 1.
For this column, Oracle does not DENSITY, even for an Outlier value :
SQL> l
  1  explain plan for
  2  select c_greatest_column from TEST_HSTGRM
  3* where c_greatest_column = 'C123_ABCDEFGHIJKLMNOPQRSTUVWXYZ_z_XYZ'
SQL> /

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 2702196353

---------------------------------------------------------------------------------
| Id  | Operation         | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |             | 10000 |   371K|    27   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| TEST_HSTGRM | 10000 |   371K|    27   (0)| 00:00:01 |
---------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("C_GREATEST_COLUMN"='C123_ABCDEFGHIJKLMNOPQRSTUVWXYZ_z_XYZ'
              )

14 rows selected.

SQL>  


Therefore, when you have columns with the first 32 chraracters/bytes being the same, do not expect Oracle to present any degree of accuracy in the Cardinality estimate. You will have to rely on other columns in the query.

.
.
.

3 comments:

Randolf said...

Hemant,

nice demonstration. For some similar histogram limitations, see my corresponding blog post about issues with high precision numbers.

Regards,
Randolf

Hemant K Chitale said...

Yes, I've just seen your post.

Maybe we should re-run our tests in 11.2 in a few months.

Rahul said...

Hi,

Nice and interesting article. Thanks a lot.

Thanks and regards,
Rahul Jain