UPDATE : These figures (the 254 distinct values, the 32 characters/bytes limit etc) were in tests on Release 10.2
Release 12.1 has relaxed some of the limits and has new Histogram types.
We rely on Oracle's collection of Histograms to denote data skew. However, if we give it some thought, it should be fairly obvious that generation of such histograms is a resource intensive task. The kernel has to identify all the distinct values for a column and, for each distinct value, count the number of occurrences of each such value.
If there are very many distinct values (more than 254) Oracle actually creates what it calls a Height Balanced Histogram, instead of a Frequency Histogram. (If the DBA explicitly specifies a number of buckets for column statistics and Oracle determines that the distinct count is more than such number, then, too, Oracle creates a Height Balanced Histogram).
Thus, it is fairly understood that
a. The number of distinct values
and
b. The number of rows in the table or sample
would determine the "effort" required to create a Histogram.
However, we sometimes forget that how "large" the column is -- in terms of how many characters (or bytes) are present in each occurrence [row] -- is also important. It would require more time and effort to compute the number of distinct values for a column with a length of 36 characters then it would for, say, a column with a length of 6 characters only. Sorting 36-character elements requires more memory/temp space and CPU (and I/O) then sorting 6-character elements. As also would be the effort to count the number of occurrences of each distinct value. (There may well be "word-size" limitations of the algorithms used as well !)
Therefore, when it comes to deriving a Histogram on such "large" columns, Oracle has to draw the line somewhere.
If the first 6 characters (actually, bytes !) of a column are the same, Oracle doesn't actually place the column values in ENDPOINT_VALUE in the %_TAB_HISTOGRAMS table. The values are placed in ENDPOINT_ACTUAL_VALUE !
However, if the first 32 characters (actually, bytes !) are the same, Oracle doesn't compare the rest of the characters in that row. Effectively, no Histogram is computed for a column where the first 32 characters (bytes) are the same.
Here I conduct a small test with columns of different values with 'N' leading characters being the same.
I first create a 10,000 row table with 4 columns and 10 distinct values in each column (with exactly 1,000 rows for each value) such that :
a. In Column A_SMALLER_COLUMN, only the first 2 characters are always the same and the third character varies
b. In Column B_LARGER_COLUMN, the first 6 characters are always the same
c. In Column C_GREATEST_COLUMN, the first 32 characters are always the same
d. In Column FIRST_FIVE_SAME, the first 5 characters the always the same
SQL> create table TEST_HSTGRM 2 ( 3 a_smaller_column varchar2(5), 4 b_larger_column varchar2(20), 5 c_greatest_column varchar2(60), 6 first_five_same varchar2(8) 7 ); Table created. SQL> SQL> declare 2 num_rows number; 3 4 begin 5 for num_rows in 1..10000 6 loop 7 insert into TEST_HSTGRM 8 values ( 9 'A_'||mod(num_rows,10)||'_X', 10 'Large_'||mod(num_rows,10)||'_XY', 11 'C123_ABCDEFGHIJKLMNOPQRSTUVWXYZ_'||mod(num_rows,10)||'_XYZ', 12 'Five_'||mod(num_rows,10)); 13 end loop; 14 end; 15 / PL/SQL procedure successfully completed. SQL> SQL> SQL> exec dbms_stats.gather_table_stats('','TEST_HSTGRM',estimate_percent=>100,- > method_opt=>'FOR ALL COLUMNS SIZE 254'); PL/SQL procedure successfully completed.
I next see what column statistic Oracle has collected :
SQL> col column_name format a18 SQL> col endpoint_value format 9999999999999999999999999999999999999 SQL> col endpoint_actual_value format a30 SQL> SQL> select column_name, sample_size, num_distinct, num_buckets, histogram, 2 low_value, high_value 3 from user_tab_columns 4 where table_name = 'TEST_HSTGRM'; COLUMN_NAME SAMPLE_SIZE NUM_DISTINCT NUM_BUCKETS HISTOGRAM ------------------ ----------- ------------ ----------- --------------- LOW_VALUE ---------------------------------------------------------------- HIGH_VALUE ---------------------------------------------------------------- A_SMALLER_COLUMN 10000 10 10 FREQUENCY 415F305F58 415F395F58 B_LARGER_COLUMN 10000 10 10 FREQUENCY 4C617267655F305F5859 4C617267655F395F5859 C_GREATEST_COLUMN 10000 1 1 FREQUENCY 433132335F4142434445464748494A4B4C4D4E4F505152535455565758595A5F 433132335F4142434445464748494A4B4C4D4E4F505152535455565758595A5F FIRST_FIVE_SAME 10000 10 10 FREQUENCY 466976655F30 466976655F39 SQL> SQL> select column_name, count(*) 2 from user_tab_histograms 3 where table_name = 'TEST_HSTGRM' 4 group by column_name order by 1; COLUMN_NAME COUNT(*) ------------------ ---------- A_SMALLER_COLUMN 10 B_LARGER_COLUMN 10 C_GREATEST_COLUMN 1 FIRST_FIVE_SAME 10 SQL>
Notice how Oracle could not (did not) compute a Histogram for C_GREATEST_COLUMN. With the first 32 characters being the same, Oracle just says that there is "1 distinct value". The LOW_VALUE and HIGH_VALUE are the same.
Next, I look at the actual Histograms on these columns :
SQL> set linesize132 SQL> select column_name, endpoint_value, endpoint_actual_value, endpoint_number 2 from user_tab_histograms 3 where table_name = 'TEST_HSTGRM' order by 1,3,2; COLUMN_NAME ENDPOINT_VALUE ENDPOINT_ACTUAL_VALUE ENDPOINT_NUMBER ------------------ -------------------------------------- ------------------------------ --------------- A_SMALLER_COLUMN 339429957176373000000000000000000000 1000 A_SMALLER_COLUMN 339430036404535000000000000000000000 2000 A_SMALLER_COLUMN 339430115632698000000000000000000000 3000 A_SMALLER_COLUMN 339430194860860000000000000000000000 4000 A_SMALLER_COLUMN 339430274089023000000000000000000000 5000 A_SMALLER_COLUMN 339430353317185000000000000000000000 6000 A_SMALLER_COLUMN 339430432545348000000000000000000000 7000 A_SMALLER_COLUMN 339430511773510000000000000000000000 8000 A_SMALLER_COLUMN 339430591001673000000000000000000000 9000 A_SMALLER_COLUMN 339430670229835000000000000000000000 10000 B_LARGER_COLUMN 396591018990235000000000000000000000 Large_0_XY 1000 B_LARGER_COLUMN 396591018990235000000000000000000000 Large_1_XY 2000 B_LARGER_COLUMN 396591018990235000000000000000000000 Large_2_XY 3000 B_LARGER_COLUMN 396591018990235000000000000000000000 Large_3_XY 4000 B_LARGER_COLUMN 396591018990235000000000000000000000 Large_4_XY 5000 B_LARGER_COLUMN 396591018990235000000000000000000000 Large_5_XY 6000 B_LARGER_COLUMN 396591018990235000000000000000000000 Large_6_XY 7000 B_LARGER_COLUMN 396591018990235000000000000000000000 Large_7_XY 8000 B_LARGER_COLUMN 396591018990235000000000000000000000 Large_8_XY 9000 B_LARGER_COLUMN 396591018990235000000000000000000000 Large_9_XY 10000 C_GREATEST_COLUMN 348881704899430000000000000000000000 10000 FIRST_FIVE_SAME 365599813402059000000000000000000000 1000 FIRST_FIVE_SAME 365599813402063000000000000000000000 2000 FIRST_FIVE_SAME 365599813402068000000000000000000000 3000 FIRST_FIVE_SAME 365599813402073000000000000000000000 4000 FIRST_FIVE_SAME 365599813402078000000000000000000000 5000 FIRST_FIVE_SAME 365599813402082000000000000000000000 6000 FIRST_FIVE_SAME 365599813402087000000000000000000000 7000 FIRST_FIVE_SAME 365599813402092000000000000000000000 8000 FIRST_FIVE_SAME 365599813402096000000000000000000000 9000 FIRST_FIVE_SAME 365599813402101000000000000000000000 10000 31 rows selected. SQL>
As we've already seen, there is no real Histogram for C_GREATEST_COLUMN. There is a single entry representing 10,000 rows.
However, for column B_LARGER_COLUMN (where the first 6 characters are the same in every row), the actual value for each Histogram endpoint is visible in the ENDPOINT_ACTUAL_VALUE column while ENDPOINT_VALUE appears to be the same.
Finally, I look at how Oracle uses the Histograms to come up with the Cardinality estimates :
SQL> explain plan for 2 select a_smaller_column from TEST_HSTGRM where a_smaller_column = 'A_3_X'; Explained. SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT ---------------------------------------------------------------------------------------------------- Plan hash value: 2702196353 --------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1000 | 6000 | 27 (0)| 00:00:01 | |* 1 | TABLE ACCESS FULL| TEST_HSTGRM | 1000 | 6000 | 27 (0)| 00:00:01 | --------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("A_SMALLER_COLUMN"='A_3_X') 13 rows selected. SQL> select count(*) from TEST_HSTGRM where a_smaller_column = 'A_3_X'; COUNT(*) ---------- 1000 SQL> SQL> explain plan for 2 select b_larger_column from TEST_HSTGRM where b_larger_column = 'Large_3_XY'; Explained. SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT ---------------------------------------------------------------------------------------------------- Plan hash value: 2702196353 --------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1000 | 11000 | 27 (0)| 00:00:01 | |* 1 | TABLE ACCESS FULL| TEST_HSTGRM | 1000 | 11000 | 27 (0)| 00:00:01 | --------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("B_LARGER_COLUMN"='Large_3_XY') 13 rows selected. SQL> select count(*) from TEST_HSTGRM where b_larger_column = 'Large_3_XY'; COUNT(*) ---------- 1000 SQL> SQL> explain plan for 2 select c_greatest_column from TEST_HSTGRM 3 where c_greatest_column = 'C123_ABCDEFGHIJKLMNOPQRSTUVWXYZ_3_XYZ'; Explained. SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT ---------------------------------------------------------------------------------------------------- Plan hash value: 2702196353 --------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 10000 | 371K| 27 (0)| 00:00:01 | |* 1 | TABLE ACCESS FULL| TEST_HSTGRM | 10000 | 371K| 27 (0)| 00:00:01 | --------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("C_GREATEST_COLUMN"='C123_ABCDEFGHIJKLMNOPQRSTUVWXYZ_3_XYZ' ) 14 rows selected. SQL> select count(*) from TEST_HSTGRM 2 where c_greatest_column = 'C123_ABCDEFGHIJKLMNOPQRSTUVWXYZ_3_XYZ'; COUNT(*) ---------- 1000 SQL> SQL> explain plan for 2 select first_five_same from TEST_HSTGRM where first_five_same = 'Five_3'; Explained. SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT ---------------------------------------------------------------------------------------------------- Plan hash value: 2702196353 --------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1000 | 7000 | 27 (0)| 00:00:01 | |* 1 | TABLE ACCESS FULL| TEST_HSTGRM | 1000 | 7000 | 27 (0)| 00:00:01 | --------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("FIRST_FIVE_SAME"='Five_3') 13 rows selected. SQL> select count(*) from TEST_HSTGRM where first_five_same = 'Five_3'; COUNT(*) ---------- 1000 SQL>
For Column A_SMALLER_COLUMN, the Cardinality Estimate of 1,000 rows is accurate.
For Column B_LARGER_COLUMN, too, Cardinality Estimate is accurate (thus, we should look at ENDPOINT_ACTUAL_VALUE rather than ENDPOINT_VALUE in USER_TAB_HISTOGRAMS)
For Column FIRST_FIVE_SAME, we see similar results as A_SMALLER_COLUMN.
However, for column C_GREATEST_COLUMN, in the absence of the Histogram (which did not get computed), Oracle presents a Cardinality Estimate of 1 ! For this column, LOW_VALUE and HIGH_VALUE were the same and NUM_DISTINCT was 1.
For this column, Oracle does not DENSITY, even for an Outlier value :
SQL> l 1 explain plan for 2 select c_greatest_column from TEST_HSTGRM 3* where c_greatest_column = 'C123_ABCDEFGHIJKLMNOPQRSTUVWXYZ_z_XYZ' SQL> / Explained. SQL> select * from table(dbms_xplan.display); PLAN_TABLE_OUTPUT ---------------------------------------------------------------------------------------------------- Plan hash value: 2702196353 --------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 10000 | 371K| 27 (0)| 00:00:01 | |* 1 | TABLE ACCESS FULL| TEST_HSTGRM | 10000 | 371K| 27 (0)| 00:00:01 | --------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("C_GREATEST_COLUMN"='C123_ABCDEFGHIJKLMNOPQRSTUVWXYZ_z_XYZ' ) 14 rows selected. SQL>
Therefore, when you have columns with the first 32 chraracters/bytes being the same, do not expect Oracle to present any degree of accuracy in the Cardinality estimate. You will have to rely on other columns in the query.
.
.
.
3 comments:
Hemant,
nice demonstration. For some similar histogram limitations, see my corresponding blog post about issues with high precision numbers.
Regards,
Randolf
Yes, I've just seen your post.
Maybe we should re-run our tests in 11.2 in a few months.
Hi,
Nice and interesting article. Thanks a lot.
Thanks and regards,
Rahul Jain
Post a Comment