31 May, 2011

Interpreting an AWR report when the ArchiveLog Dest or FRA was full

It can be very interesting / challenging / confusing / misleading / wrong to attempt to interpret an AWR report for the duration when ArchiveLogs could not be generated. Such a case is when the ArchiveLog Destination directory or FRA (whichever is in use) went to 100% full usage (or hit the db_recovery_file_dest_size and Oracle cannot auto-delete any files in the FRA, all files present being necessary to meet the RETENTION Policy).

Take such a case which I simulated today. I started a "batch" job that generates a high volume of redo and also does some disconnect/reconnect calls. This job started at Tue May 31 22:56:30 SGT 2011 and completed at Tue May 31 23:13:28 SGT 2011 (I've taken a small time frame as I cannot afford to sit up to 3am to simulate a 4hour job !).
The User / Application Adminsitrator / DBA may think that the job to 17 minutes to run. This job was expected to complete in 5 minutes. (extrapolate to 4hours and 1hour respectively for a "real world" job ?).
The DBA was asked to investigate why the job took 3 times as long. He generated an AWR report.
             Snap Id      Snap Time      Sessions Curs/Sess
--------- ------------------- -------- ---------
Begin Snap: 1215 31-May-11 22:52:52 30 1.2
End Snap: 1216 31-May-11 23:12:47 34 1.0
Elapsed: 19.92 (mins)
DB Time: 119.47 (mins)

Load Profile Per Second Per Transaction Per Exec Per Call
~~~~~~~~~~~~ --------------- --------------- ---------- ----------
DB Time(s): 6.0 57.4 0.62 4.17
DB CPU(s): 0.0 0.4 0.00 0.03
Redo size: 386,454.6 3,694,963.3
Logical reads: 1,557.5 14,891.5
Block changes: 1,822.6 17,426.5
Physical reads: 2.1 19.8
Physical writes: 29.0 276.8
User calls: 1.4 13.7
Parses: 6.0 57.0
Hard parses: 0.4 3.6
W/A MB processed: 0.1 0.7
Logons: 0.1 0.7
Executes: 9.6 92.0
Rollbacks: 0.0 0.0
Transactions: 0.1

Instance Efficiency Percentages (Target 100%)
Buffer Nowait %: 99.84 Redo NoWait %: 98.01
Buffer Hit %: 99.87 In-memory Sort %: 100.00
Library Hit %: 92.83 Soft Parse %: 93.71
Execute to Parse %: 38.05 Latch Hit %: 100.00
Parse CPU to Parse Elapsd %: 39.34 % Non-Parse CPU: 96.00

Shared Pool Statistics Begin End
------ ------
Memory Usage %: 70.72 78.91
% SQL with executions>1: 55.21 86.47
% Memory for SQL w/exec>1: 56.19 80.02

Only 440MB of redo was generated in 19.92 minutes. Only about 12MB of logical I/O per second ?
And DB Time was a phenomenal 119.47 minutes against an elapsed 19.92minutes ?
Why was the job so slow ? Was the CPU running slowly ? Was there a memory fault ? Were I/Os performed very slowly ?
Top 5 Timed Foreground Events
wait % DB
Event Waits Time(s) (ms) time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
buffer busy waits 5 1,964 4.E+05 27.4 Concurrenc
log file switch (archiving nee 5 1,900 4.E+05 26.5 Configurat
DB CPU 48 .7
JS kgl get object wait 30 24 803 .3 Administra
log buffer space 32 19 585 .3 Configurat

OMG ! Buffer Busy Waits accounted for 27.4% of total database response time !
Yes, log file switch (archiving needed) accounted for 26.5% of total database response time !
Action Item for the DBA : He MUST find the reason for those buffer busy waits ! These seem to be more important than the log file switch waits. He MUST find why the job ran so slowly.

This is the sort of error that occurs when you look at figures in isolation and also when you look at figures "averaged" out over the AWR elapsed duration.

But guess what ? The job didn't run "slow". The server CPU and Memory and I/O didn't run "slow" or had faults. The database didn't run "slow".
What had happened is that all Redo generation halted when LGWR couldn't switch into an Online Redo Log file because it hadn't been archived.
The database instance was "stuck" at this position for almost 13minutes. From 23:00:35 :
Tue May 31 23:00:35 2011
Thread 1 advanced to log sequence 37 (LGWR switch)
Current log# 1 seq# 37 mem# 0: /home/oracle/app/oracle/oradata/orcl/redo01.log
Tue May 31 23:00:35 2011
Errors in file /home/oracle/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_arc0_2435.trc:
ORA-19815: WARNING: db_recovery_file_dest_size of 4294967296 bytes is 100.00% used, and has 0 remaining bytes available.
You have following choices to free up space from recovery area:
1. Consider changing RMAN RETENTION POLICY. If you are using Data Guard,
2. Back up files to tertiary device such as tape using RMAN
3. Add disk space and increase db_recovery_file_dest_size parameter to
reflect the new space.
4. Delete unnecessary files using RMAN DELETE command. If an operating
system command was used to delete files, then use RMAN CROSSCHECK and
Errors in file /home/oracle/app/oracle/diag/rdbms/orcl/orcl/trace/orcl_arc0_2435.trc:
ORA-19809: limit exceeded for recovery files
ORA-19804: cannot reclaim 41253888 bytes disk space from 4294967296 limit
ARC0: Error 19809 Creating archive log file to '/addtl/oracle/flash_recovery_area/ORCL/archivelog/2011_05_31/o1_mf_1_36_%u_.arc'
ARCH: Archival stopped, error occurred. Will continue retrying
ORACLE Instance orcl - Archival Error
ORA-16038: log 3 sequence# 36 cannot be archived
ORA-19809: limit exceeded for recovery files

To 23:12:58 :
Tue May 31 23:12:58 2011
Archived Log entry 605 added for thread 1 sequence 36 ID 0x4c1c54ba dest 1:
Archiver process freed from errors. No longer stopped
Tue May 31 23:12:59 2011
Thread 1 advanced to log sequence 39 (LGWR switch)
Current log# 3 seq# 39 mem# 0: /home/oracle/app/oracle/oradata/orcl/redo03.log

Of the 19.92 minutes, 12minutes22seconds were completely lost when the database instance was "frozen".
That 440MB of Redo was not generated over 19.92minutes but over about 7minutes33seconds.

Guess what ? Those 1,964 seconds of buffer busy waits were the process "stuck" in the wait -- the buffer wasn't being freed because no activity was possible. The actual number of logical I/Os wasn't 1,557.5 per second but double that -- because the database instance was active for only half the elapsed duration.

Remember :
1. One wait event may well be attributed to another wait event. AWR or the DBA or the Performance Analyst may wrongly "double count" waits.

2. Statistics presented for an elapsed duration do not necessarily mean that every second of that duration was equally loaded. Some of the elapsed duration (9minutes22seconds here) may have been completely idle time. Unfortunately, when we read an AWR we assume that the whole duration of the AWR has had a uniform load.

3. Statistics are accumulated for multiple sessions. In less than 20minutes elapsed time we had 32minutes of "buffer busy waits" time !

4. Such wrong interpretations are more pernicious (if that is the right word) when an AWR is for a longer duration. If a job runs for 30minutes and no other activity occurs over the next 30minutes, an AWR report for a 60minute duration will under-report the load of the job by 50% !


1 comment:

Joel Garry said...

And yet, you can bet you would have gotten a 3:00AM phone call if any online users were locked up! Or if you had an alert for such a situation, if appropriate.

Great example of scoping, and reminder to check the alert log first.

word: unctin