Previous installment: Shakespeare’s advice for database upgrades First installment: DON’T PANIC
“Heigh-ho, sing heigh-ho, unto the green holly. Most friendship is feigning, most loving mere folly. Then heigh-ho, the holly. This life is most jolly.”
We can’t help attaching meaning to optimizer statistics. After all, they represent our data, don’t they? So we refresh them as frequently as possible and use the largest sample sizes that we can use. Recently, for the first time in my life, I encountered a group of DBAs who understood that statistics do not have any intrinsic meaning; that they are nothing more than a collection of numbers that influence the generation of query plans. In the case of these DBAs, they found that the strategy that worked best for their applications was to treat the optimizer statistics just as they would treat program code. Statistics were checked into the source code repository just as program code was checked into the source code repository. In other words, they did not regularly refresh optimizer statistics like the rest of the world. Even so, the query plans were good enough. But, most importantly, they were blessed with plan stability. In fact, plan stability was their primary goal.
Perhaps I give them too much credit. Perhaps they had not actually concluded that statistics do not have any intrinsic meaning. Perhaps they were just doing what they had found worked best for their applications in practice. I was reminded of my favorite quotes about statistics. But never in my wildest dreams did I imagine that one day I would meet a group that had adopted these suggestions to the extent that this group had done.
There is some official precedent for such a strategy. In the case of extremely volatile tables, one of the suggestions of the SQL tuning guide is to collect statistics when the tables contain some data and lock them. If this strategy works well for extremely volatile tables, might not it work for less volatile tables?
And, in the case of system statistics, most organizations use the default values even though they are hardly likely to be representative of any particular database server. That’s a legitimate strategy because, as I said before, statistics are nothing more than a collection of numbers that influence the generation of query plans.
Were these simple applications? Was this a group of junior DBAs? Were these small databases? Was this a small company? No, no, no, and no.
But surely there were some queries whose plans were not good enough? I’m sure there were. In those cases, they probably did what the rest of us do (other than refreshing statistics) when plans are not good enough; that is, they hinted the queries or rewrote them in ways that the optimizer can handle better.
You might also question how they handled columns for which the upper and lower bounds could not be predicted in advance. Well, they used column statistics only when necessary, sometimes going so far as to manually set the LOWVAL and HIVAL in SYS.HIST_HEAD$. And, they used column histograms only when necessary.
Here is a tiny demonstration which demonstrates that the optimizer may be able to construct reasonably efficient query plans even without a full set of statistics. The optimizer can use any information that is available to it; for example, size information and index information. After all, “small tables tend to stay small, large tables tend to stay large, unique indexes have a tendency to stay unique, and non-unique indexes often stay non-unique.”
The following simple query was tested in the HR sample schema with different sets of statistics in Oracle Database 188.8.131.52: select sum(salary) from employees where department_id = 80. The results are shown in the following table. The Consistent Gets column indicates the efficiency of the query plan; the lower the better. The final test involved a full set of statistics but did not result in the most efficient query plan.
No table statistics
INDEX RANGE SCAN (EMP_DEPARTMENT_IX)TABLE ACCESS BY INDEX ROWID BATCHED (EMPLOYEES)
TABLE ACCESS FULL (EMPLOYEES)
Statistics gathered without histograms
LOWVAL and HIVAL removed from SYS.HIST_HEAD$
Column statistics deleted
Histograms of maximum size
The full script and log file are in statistics.zip (attached). Fascinating stuff.
Optimizer statistics do not have any intrinsic meaning; that they are nothing more than a collection of numbers that influence the generation of query plans. Treating statistics like source code can be a valid strategy and confer a great measure of plan stability.
For varied points of view, please read the panel discussion “Statistics—How and When” in the August 2007 issue of the NoCOUG Journal and the debate “Should we stop analyzing?” initiated on 30 Dec 2003 by Mogens Nørgaard in the Oracle-L mailing list.
Copyright © 2015 Iggy Fernandez