Thursday, July 28, 2005

Response time analysis for tuning Oracle SQL query

Jul. 21, 2005, Last Thursday I received a request from my manager to tune a SQL query statement. At the time, this SQL always take around 4 hours to finish. It's really an unacceptable response time!

After receiving this request, the first thing came in my mind is the methodology, response time analysis. Throughout these years, the methods about how to tune the Oracle performance comes and goes. It has been being developed and revised.

As the 1st generation, DBAs liked to use so-called ratio analysis. As the name indicated, people usually collect the hit ration of the various database SGA area from those statistics tables. After collecting people try to maximize the hit-ratio by enlarging the corresponding memory area and so forth. It's somewhat effective but most of time it's not. It does have it's own limitation as well.

Afterwards, the wait analysis emerged. It measures what the database is really waiting for at the time. This improved much in tuning Oracle performance. Many DBAs including me are still using this method to tune SQL, most of time. It can tell you immediately what a running SQL is waiting. By concentrating on what caused the waits and reduce/eliminate them, most of time you can efficiently and effectively improve the SQL performance. However, it can only tell you what the DB is waiting for, it can NOT tell you where the majority of time is being consumed. That's why the response time analysis comes in place and take the role!

The response time analysis measures how and where applications eat the time before the completion. It needs to turn on the Oracle trace feature. There is a lot of information in Oracle raw trace file as well as the TKPROF file produced from the raw file. By turn on TIME_STATISTICS parameter and enable the event 10046 trace, everything during the SQL execution could be recorded in raw trace file including CPU time, elapsed time, physical reads, consistent gets, db block gets and so on. You will be able to exactly know which part of a SQL consumes the time aggressively regardless the cache ratio and the waiting time. Hence you can focus on it and work against it!

Time to get back to the case! In this case, the wait analysis doesn't help much. There is always 'db file sequential read' in v$session_wait and a high CPU usage on OS level. Based on these facts, I really have no idea what's going on during the SQL execution. Even there is always 'db file sequential read' in v$session_wait, the total waited time is only a small portion of the total execution time told by v$session_event after finishing. So the root cause is neither an inefficient index nor the I/O problem. By turning on Oracle trace using 'oradebug' utility, I find out the root cause is there are too many consistent gets. The more the consistent gets, the longer elapsed time, the longer the CPU time. This also is the cause of the high CPU usage! This also bring another popular tuning topic out, minimizing the logical I/O. Once the logical I/O has been minimized, the physical ones will be minimized as well automatically. If you take care pennies, dollar will take care of itself! Following this clue, I also figure out the abnormal consistent gets are produced by the improper order of nested loops in the SQL execution plan generated by Oracle. By reversing the order, consistent gets dropped dramatically. From user's point of view, the response time dropped down by 87.5%, from original 80 minutes to 9 minutes! A big improvement!

Life can only be understood backwards,
but it must be lived forwards!