28 January, 2011

Transaction Failure --- when is the error returned ?

There was a question in the orafaq forums
"What happens when UNDO space is full? Transaction takes more time to ERROR?"
where my response was to the effect that the user sees the error message only after the rollback is completed.
Thus,
a) if a transaction began at 1pm
b) but, at 3pm, found that it had run out of space in the Undo tablespace (e.g. the datafile is fully occupied and does not have autoextend enable) and other undo extents / segments cannot be expired and reused
c) the user/client does not see the ORA-30036 error at 3pm but only after the (automatic) rollback of the transaction is completed -- which could be at 3:45pm.

Therefore, the time you see the error is not really the exact time when the error occurred. You are presented the error message only after the rollback is completed and the database is returned to a consistent state.

Here's a simple demonstration. I run two "large" transactions (and have a relatively very small Undo tablespace) and check the timestamps of the transactions and messages :

00:17:43 SQL>
00:17:43 SQL> select count(*) from sales_lines;

COUNT(*)
----------
100000

00:17:43 SQL>
00:17:43 SQL> drop table sales_lines_multi purge;

Table dropped.

00:17:44 SQL> create table sales_lines_multi as select * from sales_lines where 1=2;

Table created.

00:17:44 SQL>
00:17:44 SQL> select systimestamp from dual;

SYSTIMESTAMP
---------------------------------------------------------------------------
28-JAN-11 12.17.44.385819 AM +08:00

00:17:44 SQL> insert into sales_lines_multi
00:17:44 2 select * from sales_lines
00:17:44 3 union all
00:17:44 4 select * from sales_lines
00:17:44 5 union all
00:17:44 6 select * from sales_lines
00:17:44 7 union all
00:17:44 8 select * from sales_lines
00:17:44 9 /

400000 rows created.

00:17:47 SQL> select systimestamp from dual;

SYSTIMESTAMP
---------------------------------------------------------------------------
28-JAN-11 12.17.47.414854 AM +08:00

00:17:47 SQL>
00:17:47 SQL> commit;

Commit complete.

00:17:47 SQL> exec dbms_lock.sleep(10);

PL/SQL procedure successfully completed.

00:17:57 SQL>
00:17:57 SQL> select systimestamp from dual;

SYSTIMESTAMP
---------------------------------------------------------------------------
28-JAN-11 12.17.57.500389 AM +08:00

00:17:57 SQL> delete sales_lines_multi;
delete sales_lines_multi
*
ERROR at line 1:
ORA-30036: unable to extend segment by 8 in undo tablespace 'UNDO'


00:18:04 SQL> select systimestamp from dual;

SYSTIMESTAMP
---------------------------------------------------------------------------
28-JAN-11 12.18.04.386362 AM +08:00

00:18:04 SQL>
00:18:04 SQL> exec dbms_lock.sleep(10);

PL/SQL procedure successfully completed.

00:18:14 SQL>
00:18:14 SQL> select systimestamp from dual;

SYSTIMESTAMP
---------------------------------------------------------------------------
28-JAN-11 12.18.14.412227 AM +08:00

00:18:14 SQL> rollback;

Rollback complete.

00:18:14 SQL> select systimestamp from dual;

SYSTIMESTAMP
---------------------------------------------------------------------------
28-JAN-11 12.18.14.437548 AM +08:00

00:18:14 SQL>
00:18:14 SQL> select count(*) from sales_lines_multi;

COUNT(*)
----------
400000

00:18:14 SQL>

Thus, I had an INSERT transaction that ran from 12.17.44.385819 to 12.17.47.414854 -- for approximately 3seconds. This transaction completed successfully.

Next, I had a DELETE transaction that ran from 12.17.57.500389 and reported an ORA-30036 slightly before 12.18.04.386362. Therefore, it seems to have run for about 7seconds before it failed. However, did the session actually fail on undo space at 12.18.04 ?

To monitor this, I wrote and ran a procedure :

create or replace procedure log_transaction_proc
as
runcnt number;
cursor c1 is select rownum from dual where rownum < addr="s.taddr" username =" 'HEMANT';"

This procedure wrote the the "log_transaction" table approximately every 5centi-seconds if it found an active transaction by the user 'HEMANT'.
UPDATE : I notice that there's a bug in the procedure which causes it to run in an infinite loop. However, that doesn't invalidate the findings that this blog post identifies (the procedure does write to the log_transaction table !).


These were some of the rows in the log_transaction table :

T_STAMP USED_UREC USED_UBLK
--------------------------------------------------------------------------- ---------- ----------
28-JAN-11 12.17.43.900921 AM 41 3
28-JAN-11 12.17.44.365687 AM 48 1
28-JAN-11 12.17.44.477116 AM 66 2
28-JAN-11 12.17.44.532674 AM 175 4
28-JAN-11 12.17.44.586143 AM 344 8
................
28-JAN-11 12.17.47.280307 AM 8102 164
28-JAN-11 12.17.47.335687 AM 8156 165
28-JAN-11 12.17.47.388669 AM 8390 169
28-JAN-11 12.17.57.593152 AM 2133 52
28-JAN-11 12.17.57.647491 AM 3915 94

Note how it found NO transaction between 12.17.47.388669 and 12.17.57.593152. Essentially, the first transaction ended some time after 12.17.47.388669 but within 5centiseconds after that !
This first transaction was the INSERT of 400,000 rows that actually required only slightly over 8,390 undo records and 169 undo blocks. (The Undo for an INSERT is actually very small).

Next, I saw these entries for the DELETE :

T_STAMP USED_UREC USED_UBLK
--------------------------------------------------------------------------- ---------- ----------
28-JAN-11 12.17.57.593152 AM 2133 52
28-JAN-11 12.17.57.647491 AM 3915 94
28-JAN-11 12.17.57.702068 AM 5897 142
28-JAN-11 12.17.57.756098 AM 8273 198
28-JAN-11 12.17.57.810300 AM 10216 245
................
28-JAN-11 12.18.01.841294 AM 119177 2839
28-JAN-11 12.18.01.894567 AM 121401 2892
28-JAN-11 12.18.01.947189 AM 121523 2895
28-JAN-11 12.18.02.000468 AM 118341 2819
28-JAN-11 12.18.02.053156 AM 115310 2747
................
28-JAN-11 12.18.04.086817 AM 14787 353
28-JAN-11 12.18.04.143813 AM 11674 279
28-JAN-11 12.18.04.200892 AM 8378 201
28-JAN-11 12.18.04.259882 AM 5457 131
28-JAN-11 12.18.04.319280 AM 2189 53


The transaction kept growing only upto sometime within 5centiseconds after 12.18.01.947189 when it was 121,523 undo rows. (We cannot know how many table rows did get deleted at this point). However, the transaction started shrinking by 12.18.02.000468. Why did it start shrinking ? Because it had failed ! (here we discover later that the failure was ORA-30036). The transaction kept shrinking --- i.e. it kept undergoing the "automatic" rollback until sometime after 12.18.04.319280. We know from the SQL execution log that the ORA-30036 error was returned to the client just before 12.18.04.386362.

Therefore, the transaction actually went out of Undo space between 12.18.01.947189 and 12.18.02.000468. The user/client/application saw the error message only slightly before 12.18.04.386362 --- i.e. more than 2 seconds later. This difference was the time it took Oracle to actually rollback the failed transaction.

I have seen cases where the time difference between when a transaction actually ran out of undo space and when the error was reported to the user was 30minutes in some cases and more than an hour in more extreme cases.

Note : The ORA-30036 error does not get reported in the alert.log file.

.
.
.

3 comments:

SydOracle said...

Nice article.
Its not just transaction failure.

Any statement that fails has to rollback any changes made by that statement prior to the failure. It has to complete the rollback before allowing the session to issue any further statements.

Hemant K Chitale said...

Gary,

However, we have to be careful in identifying such statements.

For example, if a bug causes the server process to crash/terminate/self-destruct, the user/client gets an error immediately (or at the next call). However, it is left to PMON to notice that the server process has failed and have SMON do a rollback.
That is a case where a user might get an error message (disconnection or ORA-600 or ORA-7445) before the rollback is completed -- because the user's server process itself cannot execute the rollback as it has 'died'.
The rollback is executed in the background.

Hemant

Foued said...

Nice to read, thanks