28 February, 2009

CLUSTERING_FACTOR

I had written a post about CLUSTERING_FACTOR more than a year ago.

I am re-visiting the issue here with a Test Case. I've written the SQL script with REM comments inline. Therefore, the output below is self-explanatory :


SQL>
SQL> set arraysize 5000
SQL> set pages600
SQL> set linesize132
SQL>
SQL> set feedback off
SQL>
SQL> set SQLPrompt ''
set SQLContinue ''

REM I wil be creating my test table from this table of 1.368million rows
REM This is a copy of DBA_OBJECTS multiplied many times
select num_rows, sample_size, blocks from user_tables where table_name = 'SOURCE_TABLE';

NUM_ROWS SAMPLE_SIZE BLOCKS
---------- ----------- ----------
1368063 1368063 18967


REM This is my Test Table to review CLUSTERING_FACTOR
drop table MY_TEST_TABLE purge ;
create table MY_TEST_TABLE as select * from SOURCE_TABLE where 1=2;
alter table MY_TEST_TABLE modify (OWNER not null, OBJECT_NAME not null, OBJECT_ID not null);
alter table MY_TEST_TABLE nologging;

create index MY_TEST_TABLE_NDX_1 on MY_TEST_TABLE (OWNER, OBJECT_NAME) ;


REM REM REM ################### Setup Table with first dataset

insert into MY_TEST_TABLE select * from SOURCE_TABLE ;
commit;
-- rem need to do a FullTableScan to effect delayed block cleanout
select count(*) from MY_TEST_TABLE dc;

COUNT(*)
----------
1368063

exec dbms_stats.gather_table_stats(user,'MY_TEST_TABLE',estimate_percent=>100,method_opt=>'FOR COLUMNS OWNER, OBJECT_NAME SIZE 250',cascade=>TRUE,no_invalidate=>FALSE);

select table_name, num_rows, blocks from user_tables order by 1;

TABLE_NAME NUM_ROWS BLOCKS
------------------------------ ---------- ----------
MY_TEST_TABLE 1368063 19277
SOURCE_TABLE 1368063 18967
select index_name, num_rows, distinct_keys, leaf_blocks, clustering_factor from user_indexes order by 1;

INDEX_NAME NUM_ROWS DISTINCT_KEYS LEAF_BLOCKS CLUSTERING_FACTOR
------------------------------ ---------- ------------- ----------- -----------------
MY_TEST_TABLE_NDX_1 1368063 49232 9778 1343029
REM Notice the very high CLUSTERING_FACTOR -- almost as high as the number of rows !

select count(*) from source_table where owner = 'SYSTEM';

COUNT(*)
----------
12258
select count(*) from my_test_table where owner = 'SYSTEM';

COUNT(*)
----------
12258
REM The number of rows with "owner='SYSTEM'" is actually less than 1% of the total row count !
REM ie, 12,258 out of 1,368,063


REM Let's see the Execution Plan that Oracle generates for the query
set autotrace traceonly
select owner, object_name, object_type, created from MY_TEST_TABLE where owner = 'SYSTEM';

Execution Plan
----------------------------------------------------------
Plan hash value: 965484217

-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 12258 | 610K| 3527 (2)| 00:00:43 |
|* 1 | TABLE ACCESS FULL| MY_TEST_TABLE | 12258 | 610K| 3527 (2)| 00:00:43 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("OWNER"='SYSTEM')


Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
18817 consistent gets
5373 physical reads
0 redo size
362774 bytes sent via SQL*Net to client
514 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
12258 rows processed

set autotrace off

REM That was 18,817 consistent gets for 12,258 rows, using a FullTableScan
REM some of the consistent gets were for the parse
REM (and some of the 'consistent gets' were re-reads of the same blocks because of the
REM arraysize of 5,000 being less than the number of rows, resulting in 3 additional SQL*Net round trips)
REM Surprised that Oracle prefers a FullTableScan ?
REM That is preferred because of the high CLUSTERING_FACTOR
REM which indicates to Oracle that it will have to keep skipping to a different Table Block
REM for each next ROWID fetched from the index.
REM This means that Oracle estimates it will have to make very many SingleBlockReadCalls
REM to read the rows from the Table

REM Let's see the results if I force the Index
set autotrace traceonly
select /*+ INDEX (M MY_TEST_TABLE_NDX_1) */ owner, object_name, object_type, created from MY_TEST_TABLE M where owner = 'SYSTEM';

Execution Plan
----------------------------------------------------------
Plan hash value: 519844067

---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 12258 | 610K| 12138 (1)| 00:02:26 |
| 1 | TABLE ACCESS BY INDEX ROWID| MY_TEST_TABLE | 12258 | 610K| 12138 (1)| 00:02:26 |
|* 2 | INDEX RANGE SCAN | MY_TEST_TABLE_NDX_1 | 12258 | | 90 (0)| 00:00:02 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("OWNER"='SYSTEM')


Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
10638 consistent gets
0 physical reads
0 redo size
109228 bytes sent via SQL*Net to client
514 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
12258 rows processed

set autotrace off

REM That was 10,638 consistent gets for 12,258 rows, using an IndexRangeScan
REM Oracle had preferred the FullTableScan (18,817 consistent gets) because
REM that would be achieved using MultiBlockReadCalls
REM While the retrieval using the Index would require SingleBlockReadCalls


REM REM REM ############### DELETE approximately 20% (282,528) of the rows in the table

connect hemant/hemant
Connected.
delete MY_TEST_TABLE where object_id < 11000;
commit;
-- rem need to do a FullTableScan to effect delayed block cleanout
select /*+ FULL (dc) */ count(*) from MY_TEST_TABLE dc;

COUNT(*)
----------
1085535

exec dbms_stats.gather_table_stats(user,'MY_TEST_TABLE',estimate_percent=>100,method_opt=>'FOR COLUMNS OWNER, OBJECT_NAME SIZE 250',cascade=>TRUE,no_invalidate=>FALSE);

select table_name, num_rows, blocks from user_tables order by 1;

TABLE_NAME NUM_ROWS BLOCKS
------------------------------ ---------- ----------
MY_TEST_TABLE 1085535 19277
SOURCE_TABLE 1368063 18967
select index_name, num_rows, distinct_keys, leaf_blocks, clustering_factor from user_indexes order by 1;

INDEX_NAME NUM_ROWS DISTINCT_KEYS LEAF_BLOCKS CLUSTERING_FACTOR
------------------------------ ---------- ------------- ----------- -----------------
MY_TEST_TABLE_NDX_1 1085535 39460 8372 1072489
REM I still have a very high CLUSTERING_FACTOR

select count(*) from source_table where owner = 'SYSTEM';

COUNT(*)
----------
12258
select count(*) from my_test_table where owner = 'SYSTEM';

COUNT(*)
----------
135
REM The number of rows with "owner='SYSTEM'" is much lesser now !
REM ie, 135 out of 1,085,535

REM Let's see the Execution Plan that Oracle generates for the query
set autotrace traceonly
select owner, object_name, object_type, created from MY_TEST_TABLE where owner = 'SYSTEM';

Execution Plan
----------------------------------------------------------
Plan hash value: 519844067

---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 135 | 7020 | 138 (0)| 00:00:02 |
| 1 | TABLE ACCESS BY INDEX ROWID| MY_TEST_TABLE | 135 | 7020 | 138 (0)| 00:00:02 |
|* 2 | INDEX RANGE SCAN | MY_TEST_TABLE_NDX_1 | 135 | | 4 (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("OWNER"='SYSTEM')


Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
214 consistent gets
0 physical reads
0 redo size
1559 bytes sent via SQL*Net to client
492 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
135 rows processed

set autotrace off

REM That was 214 consistent gets for 135 rows, using an IndexRangeScan
REM Aah ! Now, Oracle prefers the IndexRangeScan because it estimates
REM much lesser SingleBlockReadCalls to read the only 135 rows from the table.
REM This inspite of the high CLUSTERING_FACTOR
REM Oracle accounts for both expected Cardinality and Clustering when selecting an execution plan.


REM REM REM ############### TRUNCATE and Re-INSERT rows

connect hemant/hemant
Connected.
truncate table MY_TEST_TABLE reuse storage;
insert into MY_TEST_TABLE select * from SOURCE_TABLE ;
commit;
-- rem need to do a FullTableScan to effect delayed block cleanout
select /*+ FULL (dc) */ count(*) from MY_TEST_TABLE dc;

COUNT(*)
----------
1368063

exec dbms_stats.gather_table_stats(user,'MY_TEST_TABLE',estimate_percent=>100,method_opt=>'FOR COLUMNS OWNER, OBJECT_NAME SIZE 250',cascade=>TRUE,no_invalidate=>FALSE);

select table_name, num_rows, blocks from user_tables order by 1;

TABLE_NAME NUM_ROWS BLOCKS
------------------------------ ---------- ----------
MY_TEST_TABLE 1368063 18765
SOURCE_TABLE 1368063 18967
select index_name, num_rows, distinct_keys, leaf_blocks, clustering_factor from user_indexes order by 1;

INDEX_NAME NUM_ROWS DISTINCT_KEYS LEAF_BLOCKS CLUSTERING_FACTOR
------------------------------ ---------- ------------- ----------- -----------------
MY_TEST_TABLE_NDX_1 1368063 49232 9778 1343029

select count(*) from source_table where owner = 'SYSTEM';

COUNT(*)
----------
12258

set autotrace traceonly
select owner, object_name, object_type, created from MY_TEST_TABLE where owner = 'SYSTEM';

Execution Plan
----------------------------------------------------------
Plan hash value: 965484217

-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 12258 | 610K| 3434 (2)| 00:00:42 |
|* 1 | TABLE ACCESS FULL| MY_TEST_TABLE | 12258 | 610K| 3434 (2)| 00:00:42 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("OWNER"='SYSTEM')


Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
18779 consistent gets
3797 physical reads
0 redo size
362796 bytes sent via SQL*Net to client
514 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
12258 rows processed

set autotrace off


REM REM REM ##########################
REM What can I do to improve the CLUSTERING_FACTOR ??
REM REM REM ############### Reorder the rows !!!

connect hemant/hemant
Connected.
drop table TEMP_TBL;
drop table TEMP_TBL
*
ERROR at line 1:
ORA-00942: table or view does not exist


create table TEMP_TBL as select * from MY_TEST_TABLE where 1=2;
alter table TEMP_TBL nologging;
insert /*+ APPEND */ into TEMP_TBL select * from MY_TEST_TABLE order by OWNER, OBJECT_NAME;
commit;
truncate table MY_TEST_TABLE;
insert into MY_TEST_TABLE select * from TEMP_TBL order by OWNER, OBJECT_NAME;
commit;
-- rem need to do a FullTableScan to effect delayed block cleanout
select /*+ FULL (dc) */ count(*) from MY_TEST_TABLE dc;

COUNT(*)
----------
1368063

exec dbms_stats.gather_table_stats(user,'MY_TEST_TABLE',estimate_percent=>100,method_opt=>'FOR COLUMNS OWNER, OBJECT_NAME SIZE 250',cascade=>TRUE,no_invalidate=>FALSE);

select table_name, num_rows, blocks from user_tables order by 1;

TABLE_NAME NUM_ROWS BLOCKS
------------------------------ ---------- ----------
MY_TEST_TABLE 1368063 19277
SOURCE_TABLE 1368063 18967
TEMP_TBL
select index_name, num_rows, distinct_keys, leaf_blocks, clustering_factor from user_indexes order by 1;

INDEX_NAME NUM_ROWS DISTINCT_KEYS LEAF_BLOCKS CLUSTERING_FACTOR
------------------------------ ---------- ------------- ----------- -----------------
MY_TEST_TABLE_NDX_1 1368063 49232 7186 22498
REM Note how the CLUSTERING_FACTOR is now down to 22,498 from the 1,343,029 on the first run !

select count(*) from source_table where owner = 'SYSTEM';

COUNT(*)
----------
12258
select count(*) from MY_TEST_TABLE where owner = 'SYSTEM';

COUNT(*)
----------
12258
REM The number of rows with "owner='SYSTEM'" is the same in as in the first test.
REM ie, 12,258 out of 1,368,063

set autotrace traceonly
select owner, object_name, object_type, created from MY_TEST_TABLE where owner = 'SYSTEM';

Execution Plan
----------------------------------------------------------
Plan hash value: 519844067

---------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 12258 | 610K| 270 (1)| 00:00:04 |
| 1 | TABLE ACCESS BY INDEX ROWID| MY_TEST_TABLE | 12258 | 610K| 270 (1)| 00:00:04 |
|* 2 | INDEX RANGE SCAN | MY_TEST_TABLE_NDX_1 | 12258 | | 67 (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("OWNER"='SYSTEM')


Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
273 consistent gets
0 physical reads
0 redo size
99453 bytes sent via SQL*Net to client
514 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
12258 rows processed

set autotrace off

REM That was only 273 consistent gets for 12,235 rows, using an IndexRangeScan
REM With the table's rows *physically re-ordered*
REM Oracle now had to do only 273 block gets
REM when the forced IndexRangeScan in the badly clustered table resulted in 10,638 block gets !

REM The *same* data, after being re-ordered
REM is down from 106,38 block gets to 273 block gets
REM with the *same* Execution Plan !!

drop table TEMP_TBL;

spool off

3 comments:

Unknown said...

Hi Hemant, I read your post and the interesting proposed tests. Some time ago
I examined the subject, I believe that clustering factor is a fundamental importance
for access type range scan. If I remember correctly, this factor influences
heavily Cost formula when the use of the index case
mode range scan. So I think that if you investigate as a table
is accessed and it can verify that most of the accesses
are using a certain index range scan, then
the sorting table data is a factor to be taken seriously.
I wrote not only for this reason, but mainly to ask you a question that has nothing
to do with clustering factor, but with the testing so far.
If I remember correctly I read somewhere that when you use the package
dbms_stats on a table that also performs cleanout of the blocks,
is it correct?
so in your example the following SQL statement
- Rem FullTableScan need to do to effect a delayed block cleanout
select count (*) from MY_TEST_TABLE dc;

followed

dbms_stats.gather_table_stats exec (user, 'MY_TEST_TABLE', estimate_percent => 100, method_opt => 'FOR COLUMNS OWNER, OBJECT_NAME SIZE 250', cascade => TRUE, no_invalidate => FALSE);

it can be replaced with the only run of the package dbms_stats

dbms_stats.gather_table_stats exec (user, 'MY_TEST_TABLE', estimate_percent => 100, method_opt => 'FOR COLUMNS OWNER, OBJECT_NAME SIZE 250', cascade => TRUE, no_invalidate => FALSE);

is it correct?

Thanks a lot

Donatello Settembrino

Hemant K Chitale said...

Reordering the data to improve the clustering factor makes sense only if
a. There is only 1 index on the table
OR
b. All the important queries rely on the one particular index and you can afford degradation in the performance of other queries relying on other indexes.


I explicitly put in a count(*) so as to avoid delayed block clenaout on large tables in my tests.
The gather_stats does the same thing as well, as you've pointed out.
See http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:44798632736844

Unknown said...

Hemant,

"Reordering the data to improve the clustering factor makes sense only if
a. There is only 1 index on the table"

I think his statement is correct but not complete
I think it is more correct to say

a.There is only 1 index on the table, or if the first fields
that make up the key, of two or more indices coincide. Can improve
clustering factor even when it has been defined indexes
a table whose data were included in a full order
different from the keys of the index, where then the clustering factor
is high for all indexes.

b. I think it is identical to
I said in my previous posting

"So I think that if you investigate as a table
is accessed and it can verify that most of the accesses
are using a certain index range scan, then
the sorting table data is a factor to be taken seriously"

which would imply that if such access more (eg 95% of cases)
are all attributable to the use of a single index I
can accept the performance degradation of the second index (5% of cases)
... but that obviously are chosen :)

best regards

Donatello Settembrino