15 April, 2008

Indexed column (unique or not) -- What if it is NULLable

One of the numerous facets of the Optimizer that Jonathan Lewis touches on is the fact that the Optimizer can and does make use of information about the possible presence of NULLs in an indexed column. Where we have a single column index (or even a multi-column index if none of the columns is NOT NULL), the Optimizer has to assume that there may be one or more NULLs present for that column. It doesn't matter if statistics have been gathered very recently (just before a query is to be optimized) ; it doesn't matter if we (the DBA and Analyst) *know* that there are no NULLs in the column.
The Optimizer must assume that a column not defined as NOT NULL may have NULLs.
From that assumption, it must also assume that not every row in the table is included in an index on that column as NULLs are not indexed. {See my other postings [1] and [2] on indexing NULLs}.
Whether a column is NULLable or is explicitly NOT NULL can help Oracle determine if an Index is to be used.

I've built a small case study here. Assume that we have a PARTS table with a Unique Key (but not defined as a Primary Key) on PART_ID. What would be the COST and CARD estimates for queries against PART_ID ? These become more important when a query on PARTS is a subquery in or part of a much larger, complex query.

(note : I've substituted "geq" and "leq" for ">=" and "<=" in the Predicate information section of the explain plan for the Range Scan queries so that the ">" and "<" don't get misinterpreted)

SQL>
SQL> set pages600
SQL> set linesize132
SQL> set SQLPrompt ''

set feedback off

-- create the PARTS table
drop table parts;
create table parts (part_id number, part_name varchar2(128), part_descr varchar2(128)) nologging;
insert /*+ APPEND */ into parts
2 select object_id, object_name'_'rownum, owner'_'object_name'_'object_type
3 from dba_objects where object_id is not null
4 -- deliberately order by part_id to get good clustering ?
5 order by object_id ;
commit;

-- currently the column is NULLable, let's create a Unique Index
create unique index parts_uk on parts(part_id) nologging;

alter system flush buffer_cache;
-- force any delayed block cleanout and load blocks into sga. This isn't important to the Optimizer in 10g.
select /*+ FULL (p) */ count(*) from parts p;

COUNT(*)
----------
51750
select /*+ INDEX (p parts_uk) */ count(*) from parts p where part_id is not null;

COUNT(*)
----------
51750


exec dbms_stats.gather_table_stats('',tabname=>'PARTS',method_opt=>'FOR ALL COLUMNS SIZE 1',estimate_percent=>100,cascade=>TRUE);

select blocks, num_rows, avg_row_len from user_tables where table_name = 'PARTS';

BLOCKS NUM_ROWS AVG_ROW_LEN
---------- ---------- -----------
581 51750 73
select blevel, leaf_blocks, num_rows, clustering_factor from user_indexes where index_name = 'PARTS_UK';

BLEVEL LEAF_BLOCKS NUM_ROWS CLUSTERING_FACTOR
---------- ----------- ---------- -----------------
1 107 51750 563

-- begin analysis . What are Oracle's COST and CARD for this table ?
-- the impact is to be considered if these queries were subqueries in more detailed/complex queries

-- first case : All PART_IDs, ordered
explain plan for select part_id from parts order by part_id;
select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3769467330

------------------------------------------------------------------------------------
Id Operation Name Rows Bytes TempSpc Cost (%CPU) Time
------------------------------------------------------------------------------------
0 SELECT STATEMENT 51750 252K 297 (4) 00:00:04
1 SORT ORDER BY 51750 252K 1240K 297 (4) 00:00:04
2 TABLE ACCESS FULL PARTS 51750 252K 131 (2) 00:00:02
------------------------------------------------------------------------------------
rollback;

-- second case : All PART_IDs, order doesn't matter
explain plan for select part_id from parts;
select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3931018009

---------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------
0 SELECT STATEMENT 51750 252K 131 (2) 00:00:02
1 TABLE ACCESS FULL PARTS 51750 252K 131 (2) 00:00:02
---------------------------------------------------------------------------
rollback;

-- third case : Single PART_ID
explain plan for select part_id from parts where part_id=20001;
select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 2169424942

------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
------------------------------------------------------------------------------
0 SELECT STATEMENT 1 5 1 (0) 00:00:01
* 1 INDEX UNIQUE SCAN PARTS_UK 1 5 1 (0) 00:00:01
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - access("PART_ID"=20001)
rollback;

-- third case : Range of PART_IDs
explain plan for select part_id from parts where part_id between 25000 and 25050;
select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 2044929039

-----------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
-----------------------------------------------------------------------------
0 SELECT STATEMENT 47 235 2 (0) 00:00:01
* 1 INDEX RANGE SCAN PARTS_UK 47 235 2 (0) 00:00:01
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - access("PART_ID"geq25000 AND "PART_ID"leq25050)
rollback;


REM REM REM !!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-- We now set PART_ID to be NOT NULL
REM
ALTER TABLE PARTS MODIFY (PART_ID NOT NULL);

-- We re-run the test queries

-- first case : All PART_IDs, ordered
explain plan for select part_id from parts order by part_id;
select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 1189395249

-----------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
-----------------------------------------------------------------------------
0 SELECT STATEMENT 51750 252K 110 (2) 00:00:02
1 INDEX FULL SCAN PARTS_UK 51750 252K 110 (2) 00:00:02
-----------------------------------------------------------------------------
rollback;

-- second case : All PART_IDs, order doesn't matter
explain plan for select part_id from parts;
select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 2423297136

---------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
---------------------------------------------------------------------------------
0 SELECT STATEMENT 51750 252K 26 (4) 00:00:01
1 INDEX FAST FULL SCAN PARTS_UK 51750 252K 26 (4) 00:00:01
---------------------------------------------------------------------------------
rollback;

-- third case : Single PART_ID
explain plan for select part_id from parts where part_id=20001;
select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 2169424942

------------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
------------------------------------------------------------------------------
0 SELECT STATEMENT 1 5 1 (0) 00:00:01
* 1 INDEX UNIQUE SCAN PARTS_UK 1 5 1 (0) 00:00:01
------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - access("PART_ID"=20001)
rollback;

-- third case : Range of PART_IDs
explain plan for select part_id from parts where part_id between 25000 and 25050;
select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 2044929039

-----------------------------------------------------------------------------
Id Operation Name Rows Bytes Cost (%CPU) Time
-----------------------------------------------------------------------------
0 SELECT STATEMENT 47 235 2 (0) 00:00:01
* 1 INDEX RANGE SCAN PARTS_UK 47 235 2 (0) 00:00:01
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - access("PART_ID"geq25000 AND "PART_ID"leq25050)
rollback;





While the column (PART_ID) was not a NOT NULL, Oracle would choose a
FullTableScan for queries attempting to retrieve all the rows from the table.
The ORDER BY in the first query also adds to the COST.
Queries for a single value (20001) or a range of values (25000 to 25050) would be executed by Unique Scan and Range Scan respectively, as we expect.


What happens when the column is defined as a NOT NULL ? This, being a DDL,
invalidates parsed statements so the next execution of the same SQLs requires re-parsing.
This time, the Optimizer DOES decide to use the Index on PART_ID for the first two queries. We also see that where the query specifies an ORDER BY, Oracle does an Index Full Scan -- it "walks" the Leaf Blocks from the lowest value to the highest value, retrieving all the values in order. Oracle can now avoid having to do a SORT. Where an ORDER BY is not required, Oracle does a faster Index Fast Full Scan.

We have learnt three things :
1. If a column is guaranteed to contain no NULLs, it is best to explicitly specify it as a NOT NULL so that the Optimizer can take a "better informed" decision on the usability of an index on it.
2. When Oracle percieves a lower cost to an Index access, this can have a greater impact on the execution plan of a larger query which the table and column under consideration may only be a part of.
3. An ordered retrieval of rows can be executed by an Index Full Scan.

What I have not done in this test case is to force a fetch from table rows. The query can be satisfied by the Index alone, in this case. In the real world, too, we do come across and/or can create such indexes where a query or subquery is satisfied by an Index alone.

1 comment:

Alex Gorbachev said...

Interesting that indexes can be used to search for rows with NULL columns. This is actually the case you mentioned in the beginning - multiple columns index with at least one not null. Here are some details if interested.