In my earlier post, I demonstrated
how Data Skew can change over time, and the difference between Oracle's Cardinality estimates and the real row count can increase. That was with a Partitioned Table (partitioned by Time).
I now demonstrate the behaviour with a non-Partitioned Table, using the same data. The SALES_NP table captures Sales orders for various types of products. PROD_ID=31 represents '3.5" diskettes". (More details are available in the
previous post).
I first create SALES_NP with data upto the Year 2000 :
SQL> drop table sales_np purge;
Table dropped.
SQL> create table sales_np as select * from sales_partitioned where 1=2;
Table created.
SQL> create index sales_np_prod_ndx on sales_np(prod_id);
Index created.
SQL> create index sales_np_time_ndx on sales_np(time_id);
Index created.
SQL>
SQL> -- Populate the table with data upto Year 2000
SQL> insert into sales_np select * from sales_partitioned where time_id < to_date('01-JAN-2001','DD-MON-YYYY');
659425 rows created.
SQL>
SQL> exec dbms_stats.gather_table_stats('','SALES_NP',estimate_percent=>100,granularity=>'ALL',degree=>4,-
> method_opt=>'FOR ALL COLUMNS SIZE 250',cascade=>TRUE,no_invalidate=>FALSE);
PL/SQL procedure successfully completed.
SQL>
I then verify the row counts :
SQL> -- What are the total number of Sales records by year ?
SQL> select /*+ FULL(s) PARALLEL (s 4) */ to_char(time_id,'YYYY'), count(*)
2 from sales_np s
3 group by to_char(time_id,'YYYY')
4 order by 1
5 /
TO_C COUNT(*)
---- ----------
1998 178834
1999 247945
2000 232646
SQL>
SQL> -- How many sales exist for the range 01-Jan-1998 to 31-Dec-2001 ?
SQL> select count(*)
2 from sales_np
3 where time_id > to_date('31-DEC-1997','DD-MON-YYYY') and time_id < to_date('01-JAN-2002','DD-MON-YYYY')
4 /
COUNT(*)
----------
659425
SQL>
SQL> -- How many sales of 3.5" diskettes were made in the range 01-Jan-1998 to 31-Dec-2001 ?
SQL> select prod_id, to_char(time_id,'YYYY'), count(*)
2 from sales_np
3 where
4 prod_id=31
5 and time_id > to_date('31-DEC-1997','DD-MON-YYYY') and time_id < to_date('01-JAN-2002','DD-MON-YYYY')
6 group by prod_id,to_char(time_id,'YYYY')
7 order by 1,2
8 /
PROD_ID TO_C COUNT(*)
---------- ---- ----------
31 1998 6602
31 1999 6586
31 2000 7568
SQL>
SQL> -- How many sales 3.5" diskettes were made in 2000 ?
SQL> select prod_id, to_char(time_id,'YYYY'), count(*)
2 from sales_np
3 where
4 prod_id=31
5 and time_id > to_date('31-DEC-1999','DD-MON-YYYY') and time_id < to_date('01-JAN-2001','DD-MON-YYYY')
6 group by prod_id,to_char(time_id,'YYYY')
7 order by 1,2
8 /
PROD_ID TO_C COUNT(*)
---------- ---- ----------
31 2000 7568
SQL>
So, the Year 2000 saw 7,568 sales orders for 3.5" diskettes.
Let's see Oracle's cardinality estimates for each of the years :
SQL> REM REM ###################################################### #################
SQL> -- Verify the Cardinality Estimates for selected years
SQL>
SQL> explain plan for
2 select prod_id
3 from sales_np
4 where prod_id=31
5 and time_id > to_date('31-DEC-1998','DD-MON-YYYY') and time_id < to_date('01-JAN-2000','DD-MON-YYYY')
6 /
Explained.
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 945744872
-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 7829 | 93948 | 347 (1)| 00:00:05 |
|* 1 | TABLE ACCESS BY INDEX ROWID| SALES_NP | 7829 | 93948 | 347 (1)| 00:00:05 |
|* 2 | INDEX RANGE SCAN | SALES_NP_PROD_NDX | 20756 | | 67 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("TIME_ID" "TIME_ID">TO_DATE(' 1998-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
2 - access("PROD_ID"=31)
16 rows selected.
SQL>
SQL> explain plan for
2 select prod_id
3 from sales_np
4 where prod_id=31
5 and time_id > to_date('31-DEC-1999','DD-MON-YYYY') and time_id < to_date('01-JAN-2001','DD-MON-YYYY')
6 /
Explained.
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 945744872
-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 7339 | 88068 | 347 (1)| 00:00:05 |
|* 1 | TABLE ACCESS BY INDEX ROWID| SALES_NP | 7339 | 88068 | 347 (1)| 00:00:05 |
|* 2 | INDEX RANGE SCAN | SALES_NP_PROD_NDX | 20756 | | 67 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("TIME_ID">TO_DATE(' 1999-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
"TIME_ID" 2 - access("PROD_ID"=31)
16 rows selected.
SQL>
SQL> explain plan for
2 select prod_id
3 from sales_np
4 where prod_id=31
5 and time_id > to_date('31-DEC-2000','DD-MON-YYYY') and time_id < to_date('01-JAN-2002','DD-MON-YYYY')
6 /
Explained.
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3261648362
-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 19 | 228 | 40 (0)| 00:00:01 |
|* 1 | TABLE ACCESS BY INDEX ROWID| SALES_NP | 19 | 228 | 40 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | SALES_NP_TIME_NDX | 602 | | 5 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("PROD_ID"=31)
2 - access("TIME_ID">TO_DATE(' 2000-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
"TIME_ID"
16 rows selected.
SQL>
SQL> explain plan for
2 select prod_id
3 from sales_np
4 where prod_id=31
5 and time_id > to_date('31-DEC-2001','DD-MON-YYYY') and time_id < to_date('01-JAN-2003','DD-MON-YYYY')
6 /
Explained.
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3261648362
-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 19 | 228 | 28 (0)| 00:00:01 |
|* 1 | TABLE ACCESS BY INDEX ROWID| SALES_NP | 19 | 228 | 28 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | SALES_NP_TIME_NDX | 401 | | 4 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("PROD_ID"=31)
2 - access("TIME_ID">TO_DATE(' 2001-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
"TIME_ID"
16 rows selected.
SQL>
So we know the estimates for the years 1999 and 2000 (at 7829 and 7339 rows) aren't too far off the real row counts (6586 and 7568).
I now insert the data for the year 2001 :
SQL> -- let's insert 2001 data
SQL>
SQL> insert /*+ APPEND */ into sales_np
2 select *
3 from sales_partitioned
4 where 1=1
5 and time_id > to_date('31-DEC-2000','DD-MON-YYYY') and time_id < to_date('01-JAN-2002','DD-MON-YYYY')
6 and decode(prod_id,31,mod(cust_id,3),0)=0
7 /
257877 rows created.
SQL> commit;
Commit complete.
SQL>
SQL> -- the number of total sales orders in 2001
SQL> select count(*)
2 from sales_np
3 where time_id > to_date('31-DEC-2000','DD-MON-YYYY') and time_id < to_date('01-JAN-2002','DD-MON-YYYY')
4 /
COUNT(*)
----------
257877
SQL>
SQL> -- the number of sales orders for 3.5" diskettes in 2001
SQL> select count(*)
2 from sales_np
3 where time_id > to_date('31-DEC-2000','DD-MON-YYYY') and time_id < to_date('01-JAN-2002','DD-MON-YYYY')
4 and prod_id=31
5 /
COUNT(*)
----------
811
SQL>
SQL> -- Gather Statistics
SQL> exec dbms_stats.gather_table_stats('','SALES_NP',estimate_percent=>100,granularity=>'ALL',degree=>4,-
> method_opt=>'FOR ALL COLUMNS SIZE 250',cascade=>TRUE,no_invalidate=>FALSE);
PL/SQL procedure successfully completed.
SQL>
So I have inserted 811 sales orders for 3.5" diskettes in the year 2001. (In the previous example, with a Partitioned Table, it was 2,352 orders in the year 2001 and 811 in the year 2002).
Let's re-check the Cardinality estimates. Remember : I did NOT change the count of rows in the years 1998 to 2000 ! They are still the same. I have only added new rows in 2001 such that we have much fewer orders for 3.5" diskettes, although the total number of rows for 2001 actually is higher than that for 2000. I have increased the skew for 3.5" diskettes -- representing significantly depressed sales for this product alone.
SQL> -- ReVerify the Cardinality Estimates by year
SQL> explain plan for
2 select prod_id
3 from sales_np
4 where prod_id=31
5 and time_id > to_date('31-DEC-1998','DD-MON-YYYY') and time_id < to_date('01-JAN-2000','DD-MON-YYYY')
6 /
Explained.
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 945744872
-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 5852 | 70224 | 364 (1)| 00:00:05 |
|* 1 | TABLE ACCESS BY INDEX ROWID| SALES_NP | 5852 | 70224 | 364 (1)| 00:00:05 |
|* 2 | INDEX RANGE SCAN | SALES_NP_PROD_NDX | 21567 | | 72 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("TIME_ID" "TIME_ID">TO_DATE(' 1998-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
2 - access("PROD_ID"=31)
16 rows selected.
SQL>
SQL> explain plan for
2 select prod_id
3 from sales_np
4 where prod_id=31
5 and time_id > to_date('31-DEC-1999','DD-MON-YYYY') and time_id < to_date('01-JAN-2001','DD-MON-YYYY')
6 /
Explained.
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 945744872
-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 5477 | 65724 | 364 (1)| 00:00:05 |
|* 1 | TABLE ACCESS BY INDEX ROWID| SALES_NP | 5477 | 65724 | 364 (1)| 00:00:05 |
|* 2 | INDEX RANGE SCAN | SALES_NP_PROD_NDX | 21567 | | 72 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("TIME_ID">TO_DATE(' 1999-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
"TIME_ID" 2 - access("PROD_ID"=31)
16 rows selected.
SQL>
SQL> explain plan for
2 select prod_id
3 from sales_np
4 where prod_id=31
5 and time_id > to_date('31-DEC-2000','DD-MON-YYYY') and time_id < to_date('01-JAN-2002','DD-MON-YYYY')
6 /
Explained.
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 945744872
-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 6076 | 72912 | 364 (1)| 00:00:05 |
|* 1 | TABLE ACCESS BY INDEX ROWID| SALES_NP | 6076 | 72912 | 364 (1)| 00:00:05 |
|* 2 | INDEX RANGE SCAN | SALES_NP_PROD_NDX | 21567 | | 72 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("TIME_ID">TO_DATE(' 2000-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
"TIME_ID" 2 - access("PROD_ID"=31)
16 rows selected.
SQL>
SQL> explain plan for
2 select prod_id
3 from sales_np
4 where prod_id=31
5 and time_id > to_date('31-DEC-2001','DD-MON-YYYY') and time_id < to_date('01-JAN-2003','DD-MON-YYYY')
6 /
Explained.
SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3261648362
-------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 15 | 180 | 46 (0)| 00:00:01 |
|* 1 | TABLE ACCESS BY INDEX ROWID| SALES_NP | 15 | 180 | 46 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | SALES_NP_TIME_NDX | 628 | | 5 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("PROD_ID"=31)
2 - access("TIME_ID">TO_DATE(' 2001-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
"TIME_ID"
16 rows selected.
SQL>
Notice how the discrepancy between Actual Row Counts and Estimated Row Counts has significantly increased -- even for earlier years were there was absolutely no change in the data. Before inserting year 2001 data, the discrepancy between Actual and Estimated for the year 2000 was only 229 (7568-7339). With 2001 data, the discrepancy increased to 2191 (7568-5477). The discrepancy for the year 2001 is now very significant --- against an actual count of 811 rows, the estimated count is 6076 rows !
Actual versus Estimated Cardinality for PROD_ID=31
Year Actual Estimated Estimated
before inserting after inserting
Year 2001 rows Year 2001 rows
1999 6586 7829 5852
2000 7568 7339 5477
2001 0 19 na
2001 inserted 811 na 6076
2002 0 19 15
Why do we see such discrepancies ?
A. Estimated Row Counts (Cardinality Estimates) are based on Column Statistics.
B. Column Statistics are at the Table level (across all the years, not for specific years).
C. As data skew changes, the estimated counts get "distributed" across the years because the estimates are averaged out across the whole table.
.
.
.