20 October, 2010

How the Optimizer can use Constraint Definitions

A simple demonstration of how properly defined Referential Integrity constraints can be used by the Optimizer.....

If two tables (say SALES and SALES_LINES) have a Parent-Child relationship but the proper R.I. constraint is not defined, it is possible (either through faulty application code that uses two separate transactions for an INSERT {or a DELETE} against the two tables OR through erroneous adhoc updates to the data) to have "dangling" child records.
For example, an INSERT into SALES errors out (for lack of space in that tablespace) but the corresponding INSERT into SALES_LINES succeeds and commits, when executed as a separate transaction.
Or an adhoc "data-fix" operation deletes rows from the SALES table without having deleted them from the SALES_LINES table first.

Having UNIQUE Indexes on the two tables will not prevent the occurrence of "parent-less children" !

When querying the SALES_LINES table, it might be necessary to validate that the corresponding SALES row exists. This could be done as a join between the two tables, explicitly included in every query against the SALES_LINES table.

Assuming that the two tables have 100,000 rows (each sale having only 1 line), with proper Unique Indexes, a query for one PROD_ID of 10 different products may execute as :


SQL> -- Query for Total Sales for PROD_ID=5
SQL> select sum(l.quantity), sum(l.quantity*l.price)
2 from sales_lines l, sales s
3 where
4 -- join code is used to validate that the sale_id is legitimate
5 l.sale_id = s.sale_id
6 and l.prod_id=5
7 and l.sale_id between 25000 and 35000
8 /

SUM(L.QUANTITY) SUM(L.QUANTITY*L.PRICE)
--------------- -----------------------
2441786 48892118.9


Execution Plan
----------------------------------------------------------
Plan hash value: 146457679

------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 39 | 139 (0)| 00:00:02 |
| 1 | SORT AGGREGATE | | 1 | 39 | | |
| 2 | NESTED LOOPS | | 984 | 38376 | 139 (0)| 00:00:02 |
|* 3 | TABLE ACCESS BY INDEX ROWID| SALES_LINES | 984 | 33456 | 139 (0)| 00:00:02 |
|* 4 | INDEX RANGE SCAN | SALES_LINES_UK | 10000 | | 24 (0)| 00:00:01 |
|* 5 | INDEX UNIQUE SCAN | SALES_UK | 1 | 5 | 0 (0)| 00:00:01 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------


3 - filter("L"."PROD_ID"=5)
4 - access("L"."SALE_ID">=25000 AND "L"."SALE_ID"<=35000)

5 - access("L"."SALE_ID"="S"."SALE_ID")
filter("S"."SALE_ID"<=35000 AND "S"."SALE_ID">=25000)

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
281 consistent gets
0 physical reads
0 redo size
523 bytes sent via SQL*Net to client
419 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

However, if I were to define the Constraints as :

SQL> REM REM ######################################
SQL> REM Now add the Constraint Definitions !
SQL> alter table SALES add constraint SALES_PK primary key (sale_id);

Table altered.

SQL> alter table SALES_LINES add constraint SALES_LINES_FK foreign key (sale_id) references SALES (sale_id);

Table altered.

SQL> REM REM ######################################

and I re-run the query :

SQL> select sum(l.quantity), sum(l.quantity*l.price)
2 from sales_lines l, sales s
3 where
4 -- join code is used to validate that the sale_id is legitimate
5 l.sale_id = s.sale_id
6 and l.prod_id=5
7 and l.sale_id between 25000 and 35000
8 /

SUM(L.QUANTITY) SUM(L.QUANTITY*L.PRICE)
--------------- -----------------------
2441786 48892118.9


Execution Plan
----------------------------------------------------------
Plan hash value: 2517766180

-----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 34 | 139 (0)| 00:00:02 |
| 1 | SORT AGGREGATE | | 1 | 34 | | |
|* 2 | TABLE ACCESS BY INDEX ROWID| SALES_LINES | 984 | 33456 | 139 (0)| 00:00:02 |
|* 3 | INDEX RANGE SCAN | SALES_LINES_UK | 10000 | | 24 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("L"."PROD_ID"=5)
3 - access("L"."SALE_ID">=25000 AND "L"."SALE_ID"<=35000)


Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
139 consistent gets
0 physical reads
0 redo size
523 bytes sent via SQL*Net to client
419 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed

Oracle can simply eliminate the lookup on the SALES table and do much fewer consistent gets.
The presence of the constraint ensures that every SALE_ID in the (child) SALES_LINES table *does* have corresponding SALE_ID in the (parent) SALES table. Since I am not fetching any columns from the SALES table, the join is now unnecessary and the optimizer (smartly) eliminates the join.

.

UPDATE : See the Optimizer team's blog post on Table (Join) Elimination.

.

.
.

08 October, 2010

Featured in Oracle Magazine

The November-December 2010 issue of Oracle Magazine includes a short interview in the Peer-to-Peer column.

.
.
.

02 October, 2010

Data Skew and Cardinality Changing --- 2

In my earlier post, I demonstrated how Data Skew can change over time, and the difference between Oracle's Cardinality estimates and the real row count can increase. That was with a Partitioned Table (partitioned by Time).

I now demonstrate the behaviour with a non-Partitioned Table, using the same data. The SALES_NP table captures Sales orders for various types of products. PROD_ID=31 represents '3.5" diskettes". (More details are available in the previous post).

I first create SALES_NP with data upto the Year 2000 :

SQL> drop table sales_np purge;

Table dropped.

SQL> create table sales_np as select * from sales_partitioned where 1=2;

Table created.

SQL> create index sales_np_prod_ndx on sales_np(prod_id);

Index created.

SQL> create index sales_np_time_ndx on sales_np(time_id);

Index created.

SQL>
SQL> -- Populate the table with data upto Year 2000
SQL> insert into sales_np select * from sales_partitioned where time_id < to_date('01-JAN-2001','DD-MON-YYYY');

659425 rows created.

SQL>
SQL> exec dbms_stats.gather_table_stats('','SALES_NP',estimate_percent=>100,granularity=>'ALL',degree=>4,-
> method_opt=>'FOR ALL COLUMNS SIZE 250',cascade=>TRUE,no_invalidate=>FALSE);

PL/SQL procedure successfully completed.

SQL>


I then verify the row counts :

SQL> -- What are the total number of Sales records by year ?
SQL> select /*+ FULL(s) PARALLEL (s 4) */ to_char(time_id,'YYYY'), count(*)
2 from sales_np s
3 group by to_char(time_id,'YYYY')
4 order by 1
5 /

TO_C COUNT(*)
---- ----------
1998 178834
1999 247945
2000 232646

SQL>
SQL> -- How many sales exist for the range 01-Jan-1998 to 31-Dec-2001 ?
SQL> select count(*)
2 from sales_np
3 where time_id > to_date('31-DEC-1997','DD-MON-YYYY') and time_id < to_date('01-JAN-2002','DD-MON-YYYY')
4 /

COUNT(*)
----------
659425

SQL>
SQL> -- How many sales of 3.5" diskettes were made in the range 01-Jan-1998 to 31-Dec-2001 ?
SQL> select prod_id, to_char(time_id,'YYYY'), count(*)
2 from sales_np
3 where
4 prod_id=31
5 and time_id > to_date('31-DEC-1997','DD-MON-YYYY') and time_id < to_date('01-JAN-2002','DD-MON-YYYY')
6 group by prod_id,to_char(time_id,'YYYY')
7 order by 1,2
8 /

PROD_ID TO_C COUNT(*)
---------- ---- ----------
31 1998 6602
31 1999 6586
31 2000 7568

SQL>
SQL> -- How many sales 3.5" diskettes were made in 2000 ?
SQL> select prod_id, to_char(time_id,'YYYY'), count(*)
2 from sales_np
3 where
4 prod_id=31
5 and time_id > to_date('31-DEC-1999','DD-MON-YYYY') and time_id < to_date('01-JAN-2001','DD-MON-YYYY')
6 group by prod_id,to_char(time_id,'YYYY')
7 order by 1,2
8 /

PROD_ID TO_C COUNT(*)
---------- ---- ----------
31 2000 7568

SQL>

So, the Year 2000 saw 7,568 sales orders for 3.5" diskettes.

Let's see Oracle's cardinality estimates for each of the years :

SQL> REM REM ###################################################### #################
SQL> -- Verify the Cardinality Estimates for selected years
SQL>
SQL> explain plan for
2 select prod_id
3 from sales_np
4 where prod_id=31
5 and time_id > to_date('31-DEC-1998','DD-MON-YYYY') and time_id < to_date('01-JAN-2000','DD-MON-YYYY')
6 /

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 945744872

-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 7829 | 93948 | 347 (1)| 00:00:05 |
|* 1 | TABLE ACCESS BY INDEX ROWID| SALES_NP | 7829 | 93948 | 347 (1)| 00:00:05 |
|* 2 | INDEX RANGE SCAN | SALES_NP_PROD_NDX | 20756 | | 67 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("TIME_ID" "TIME_ID">TO_DATE(' 1998-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
2 - access("PROD_ID"=31)

16 rows selected.

SQL>
SQL> explain plan for
2 select prod_id
3 from sales_np
4 where prod_id=31
5 and time_id > to_date('31-DEC-1999','DD-MON-YYYY') and time_id < to_date('01-JAN-2001','DD-MON-YYYY')
6 /

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 945744872

-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 7339 | 88068 | 347 (1)| 00:00:05 |
|* 1 | TABLE ACCESS BY INDEX ROWID| SALES_NP | 7339 | 88068 | 347 (1)| 00:00:05 |
|* 2 | INDEX RANGE SCAN | SALES_NP_PROD_NDX | 20756 | | 67 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("TIME_ID">TO_DATE(' 1999-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
"TIME_ID" 2 - access("PROD_ID"=31)

16 rows selected.

SQL>
SQL> explain plan for
2 select prod_id
3 from sales_np
4 where prod_id=31
5 and time_id > to_date('31-DEC-2000','DD-MON-YYYY') and time_id < to_date('01-JAN-2002','DD-MON-YYYY')
6 /

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3261648362

-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 19 | 228 | 40 (0)| 00:00:01 |
|* 1 | TABLE ACCESS BY INDEX ROWID| SALES_NP | 19 | 228 | 40 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | SALES_NP_TIME_NDX | 602 | | 5 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("PROD_ID"=31)
2 - access("TIME_ID">TO_DATE(' 2000-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
"TIME_ID"
16 rows selected.

SQL>
SQL> explain plan for
2 select prod_id
3 from sales_np
4 where prod_id=31
5 and time_id > to_date('31-DEC-2001','DD-MON-YYYY') and time_id < to_date('01-JAN-2003','DD-MON-YYYY')
6 /

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3261648362

-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 19 | 228 | 28 (0)| 00:00:01 |
|* 1 | TABLE ACCESS BY INDEX ROWID| SALES_NP | 19 | 228 | 28 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | SALES_NP_TIME_NDX | 401 | | 4 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("PROD_ID"=31)
2 - access("TIME_ID">TO_DATE(' 2001-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
"TIME_ID"
16 rows selected.

SQL>

So we know the estimates for the years 1999 and 2000 (at 7829 and 7339 rows) aren't too far off the real row counts (6586 and 7568).

I now insert the data for the year 2001 :

SQL> -- let's insert 2001 data
SQL>
SQL> insert /*+ APPEND */ into sales_np
2 select *
3 from sales_partitioned
4 where 1=1
5 and time_id > to_date('31-DEC-2000','DD-MON-YYYY') and time_id < to_date('01-JAN-2002','DD-MON-YYYY')
6 and decode(prod_id,31,mod(cust_id,3),0)=0
7 /

257877 rows created.

SQL> commit;

Commit complete.

SQL>
SQL> -- the number of total sales orders in 2001
SQL> select count(*)
2 from sales_np
3 where time_id > to_date('31-DEC-2000','DD-MON-YYYY') and time_id < to_date('01-JAN-2002','DD-MON-YYYY')
4 /

COUNT(*)
----------
257877

SQL>
SQL> -- the number of sales orders for 3.5" diskettes in 2001
SQL> select count(*)
2 from sales_np
3 where time_id > to_date('31-DEC-2000','DD-MON-YYYY') and time_id < to_date('01-JAN-2002','DD-MON-YYYY')
4 and prod_id=31
5 /

COUNT(*)
----------
811

SQL>
SQL> -- Gather Statistics
SQL> exec dbms_stats.gather_table_stats('','SALES_NP',estimate_percent=>100,granularity=>'ALL',degree=>4,-
> method_opt=>'FOR ALL COLUMNS SIZE 250',cascade=>TRUE,no_invalidate=>FALSE);

PL/SQL procedure successfully completed.

SQL>

So I have inserted 811 sales orders for 3.5" diskettes in the year 2001. (In the previous example, with a Partitioned Table, it was 2,352 orders in the year 2001 and 811 in the year 2002).

Let's re-check the Cardinality estimates. Remember : I did NOT change the count of rows in the years 1998 to 2000 ! They are still the same. I have only added new rows in 2001 such that we have much fewer orders for 3.5" diskettes, although the total number of rows for 2001 actually is higher than that for 2000. I have increased the skew for 3.5" diskettes -- representing significantly depressed sales for this product alone.


SQL> -- ReVerify the Cardinality Estimates by year
SQL> explain plan for
2 select prod_id
3 from sales_np
4 where prod_id=31
5 and time_id > to_date('31-DEC-1998','DD-MON-YYYY') and time_id < to_date('01-JAN-2000','DD-MON-YYYY')
6 /

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 945744872

-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 5852 | 70224 | 364 (1)| 00:00:05 |
|* 1 | TABLE ACCESS BY INDEX ROWID| SALES_NP | 5852 | 70224 | 364 (1)| 00:00:05 |
|* 2 | INDEX RANGE SCAN | SALES_NP_PROD_NDX | 21567 | | 72 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("TIME_ID" "TIME_ID">TO_DATE(' 1998-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
2 - access("PROD_ID"=31)

16 rows selected.

SQL>
SQL> explain plan for
2 select prod_id
3 from sales_np
4 where prod_id=31
5 and time_id > to_date('31-DEC-1999','DD-MON-YYYY') and time_id < to_date('01-JAN-2001','DD-MON-YYYY')
6 /

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 945744872

-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 5477 | 65724 | 364 (1)| 00:00:05 |
|* 1 | TABLE ACCESS BY INDEX ROWID| SALES_NP | 5477 | 65724 | 364 (1)| 00:00:05 |
|* 2 | INDEX RANGE SCAN | SALES_NP_PROD_NDX | 21567 | | 72 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("TIME_ID">TO_DATE(' 1999-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
"TIME_ID" 2 - access("PROD_ID"=31)

16 rows selected.

SQL>
SQL> explain plan for
2 select prod_id
3 from sales_np
4 where prod_id=31
5 and time_id > to_date('31-DEC-2000','DD-MON-YYYY') and time_id < to_date('01-JAN-2002','DD-MON-YYYY')
6 /

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 945744872

-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6076 | 72912 | 364 (1)| 00:00:05 |
|* 1 | TABLE ACCESS BY INDEX ROWID| SALES_NP | 6076 | 72912 | 364 (1)| 00:00:05 |
|* 2 | INDEX RANGE SCAN | SALES_NP_PROD_NDX | 21567 | | 72 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("TIME_ID">TO_DATE(' 2000-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
"TIME_ID" 2 - access("PROD_ID"=31)

16 rows selected.

SQL>
SQL> explain plan for
2 select prod_id
3 from sales_np
4 where prod_id=31
5 and time_id > to_date('31-DEC-2001','DD-MON-YYYY') and time_id < to_date('01-JAN-2003','DD-MON-YYYY')
6 /

Explained.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3261648362

-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 15 | 180 | 46 (0)| 00:00:01 |
|* 1 | TABLE ACCESS BY INDEX ROWID| SALES_NP | 15 | 180 | 46 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | SALES_NP_TIME_NDX | 628 | | 5 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("PROD_ID"=31)
2 - access("TIME_ID">TO_DATE(' 2001-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
"TIME_ID"
16 rows selected.

SQL>


Notice how the discrepancy between Actual Row Counts and Estimated Row Counts has significantly increased -- even for earlier years were there was absolutely no change in the data. Before inserting year 2001 data, the discrepancy between Actual and Estimated for the year 2000 was only 229 (7568-7339). With 2001 data, the discrepancy increased to 2191 (7568-5477). The discrepancy for the year 2001 is now very significant --- against an actual count of 811 rows, the estimated count is 6076 rows !


Actual versus Estimated Cardinality for PROD_ID=31


Year Actual Estimated Estimated
before inserting after inserting
Year 2001 rows Year 2001 rows
1999 6586 7829 5852
2000 7568 7339 5477
2001 0 19 na
2001 inserted 811 na 6076
2002 0 19 15



Why do we see such discrepancies ?
A. Estimated Row Counts (Cardinality Estimates) are based on Column Statistics.
B. Column Statistics are at the Table level (across all the years, not for specific years).
C. As data skew changes, the estimated counts get "distributed" across the years because the estimates are averaged out across the whole table.

.
.
.

01 October, 2010

Most Popular Posts

Blogger's "Stats" feature shows that the 3 most popular posts in the past 30 days have been :
3. NLS_DATE_FORMAT : 144 pageviews

.
.
.