13 February, 2007

Buffer Cache Hit Ratio GOOD or BAD ?

Recently, after a major upgrade of Database and Application version, and also to a new server, we had a Cache Hit Ratio of 99.62%. GOOD, you say ?
Very bad it was. That 99.62% also meant extremely high latch waits ("cache buffer chains") and very high Logical Reads.
With tuning, we brought Latch Waits down from 79.6% to 27.5% of the total Response Time, Logical Reads per DataBlock Changed down from 223 to 87 and an overall 74% reduction in Response Time.
BUT, our Hit Ratio [even after increasing the Buffer Cache by 50%] was down to 97.45%.
I was happy.
Very fortunately, we had Oracle Consultants who knew better than to look at the Cache Hit Ratio -- to focus on Logical Reads. Bring in people with the right attitude and setup the right KPIs and you will focus on the correct issues.

Updated 23-Feb : So, what was the problem ? The "cache buffer chains" latch waits was the indicator -- hot blocks, very high logical reads (reading the same blocks repeatedly) giving us a false Hit Ratio.

Updated 05-Mar : Latch waits are now down to less than 4% of total Response Time. Further tuning will be undertaken before we start looking at Physical Reads.

6 comments:

Anonymous said...

Hemant,

I understand that BCHR is gone.

We are also experiencing the same situation. I am curious to know what was actually done by Oracle
Consultants to resolve this issue.

Your response is highly appreciated.

Hemant K Chitale said...
This comment has been removed by the author.
Hemant K Chitale said...

What was done immediately by the Oracle Consultants was to make the observation that we had very high Logical Reads -- we needed to identify the "bad" SQLs (bad execution plans).

We did identify the major SQLs and tuned them immediately (the biggest one was an incorrect join).

If your BCHR is very high, it is more likely that you have bad SQLs. Look for SQLs with very high Buffer Gets (very high "consistent gets" in V$SESSTAT, very high "CPU used by this session")

Anonymous said...

Thanks for your reply Hemant. From your reply I feel that you might have rewritten the bad sql so that the logical reads are minimized. Let me know if I am wrong.

Related to this, I have one more question: What do you do when you see the same "bad" sql (in production) performing "good" in test environment (with almost same load)? In such a situation it is hard to justify that the sql is really bad.

Thanks once again.

Anonymous said...

To measure the performance benchmark of sql statements, hardware , storage, network infrastructure,oracle version etc. should be exactly same. The execution plan of sql are changed if you compare with different capacity harware like in production you have 8 cpu and test you have 2 cpu.

Hemant K Chitale said...

Not just the hardware, the data, the distribution and skew, the physical clustering of data and the statistics and, since 10g, system statistics all make a difference.