I've been speaking and writing about optimizer statistics for many years, but it seems like everywhere I go - be it a speaking engagement or a client site - I'm always asked about gathering statistics and how they are used. So, I thought I'd begin a several part series of "keep it simple" blog entries to cover the many aspects of statistics gathering and usage. I don't know how many parts this will end up being, but I'll just keep posting until I can't think of anything else.
In this first post, I'll provide a brief overview of what statistics are and what they do to set the foundation for future parts of the series.
The Oracle optimizer - I'll call it the CBO (cost-based optimizer) for short - is the heart and soul of the Oracle beast as far as SQL processing is concerned. The CBO's main job is to determine which operations get executed in order to retrieve the SQL result set in an optimal amount of time (i.e. as fast as possible). There are many tasks involved in how the CBO goes about deciding which operations are best, but the key element that helps the CBO determine what to do are the object statistics (such as tables and indexes) and system statistics (such as IO and CPU utilization metrics). Actually, the name Cost-Based Optimizer says a lot about what Oracle does. It uses the statistics (and a few other things like parameter settings that we won't discuss in this series) to establish a "cost" for each possible operation. Each of the individual costs for a given operation - operations like how to access your data or how to join that data together - are then compared with each other and the lowest cost operation is the winner and is selected for use in the final execution plan.
The really cool thing about this process is that the basic statistical calculations the CBO uses are just that...basic. If you've ever kept track of the batting average of your favorite baseball player, played a hand of blackjack, or any number of common, everyday things, you've utilized basic statistics. Admittedly, the CBO utilizes much more complex and sophisticated statistical algorithms, but most of the basics are, well, basic. If you execute a SQL statement that includes WHERE day_of_week = 'Tuesday', the statistic for the day_of_week column will tell Oracle that there are 7 distinct values possible. Since your query only wants rows where day_of_week = 'Tuesday', the CBO will compute that statistically speaking, one out of every seven rows would likely be 'Tuesday'. [By the way, this assumes you have no odd distributions of your data and that you've entered rows for all 7 days of the week, but we'll talk about that later in this series.] Therefore, if you have 7000 rows in the table (which is another statistic!), the CBO will compute that your query should return 1000 rows. The actual formula for this equality condition predicate would be:
number of rows in the table x (1/number of distinct values in the column used in the predicate) which in this case is 7000 x 1/7 = 1000
The CBO now is able to use this bit of information to help decide if using an index on the day_of_week column might help return those 1000 rows faster than using a full table scan would. There are actually two terms to remember associated with this formula: cardinality and selectivity. Cardinality is the number of rows estimated to be returned (1000). Selectivity is the percentage of the whole that is estimated to be returned (1/7). Both numbers are important and I'll be expanding on each throughout the series.
As I've already mentioned, there's more to it than this as there are many more ways to formulate your WHERE clauses than by just using a single equality predicate (and I'll discuss more of these calculations later in the series). But, if you start with this basic understanding of how statistics are used, you've made a good start. The key bit here is to note that the CBO must figure out how selective your query predicates are in order to determine how many rows will be returned in order to determine how to best return those rows. It's similar to computing the likelihood that your favorite baseball player is to get a hit. If every fourth time he bats, he gets a hit his batting average will be 1/4 or .250. And just like in baseball, there will be days that the player will do better than on other days, but generally speaking, his average hovers around getting a hit 1 out of every 4 at bats. The CBO is like that - it computes the statistically correct value most of the time.
Now that I've laid out a few basics concepts, it'd be good to know where these statistics are kept. To find the basic table statistics, look in the %_TABLES (where % represents either the prefix USER, ALL, or DBA) views. You can get the full description of the view and the columns it contains in the Oracle Database Reference (the 11.2 link). The columns from this view that are populated when statistics are collected are: NUM_ROWS, BLOCKS, AVG_SPACE, CHAIN_CNT, AVG_ROW_LEN. There are other columns that provide information to the CBO, but these are the columns that will come into play most for the statistical computations that are used to determine the execution plan.
The column statistics are found in the ?_TAB_COL_STATISTICS view (the 11.2 link). The key columns in this view are: NUM_DISTINCT, LOW_VALUE, HIGH_VALUE, NUM_NULLS, NUM_BUCKETS, AVG_COL_LEN, HISTOGRAM, and DENSITY (DENSITY is basically - but not always - the stored computation of 1/NUM_DISTINCT).
There are many other views that contain other information such as index (%_INDEXES) and partition (%_DBA_TAB_PARTITIONS) statistics, but the basic computations start with the base table and column statistics.
To end Part 1, I want to leave you with the answer to the question "why should I care?". If you've read this far and are thinking that this doesn't information doesn't really matter to you, then I'd like to suggest you rethink that assertion. DBAs are typically responsible for making sure statistics are collected and developers often think statistics are a subject that doesn't concern them. Remember, statistics are key to the CBO and play a critical role in how well execution plans are formulated. The better the execution plans, the better the performance of your SQL. So, regardless of your job role, understanding statistics should be a part of your working knowledge base. Without this knowledge, how will you begin to understand what's going on with your SQL? This knowledge is more important than you may think and not nearly as difficult to understand as often made out to be. If you haven't quite made the connection, my hope is that you'll get it and buy into the importance of understanding statistics before this series is complete.
Great stuff Karen - looking forward to the next entries!