This
example looks at skewed indexes and was tested on Oracle 10.2.0.3.0.
First create a very basic salary table. Start off with 1 column to store
the gender:
SQL> create table salaries
2 (gender varchar2(1))
3 pctfree 50
4 /
Table created.
SQL>
Insert 10000 rows in the table. Approximately 1% will have a gender of F (for female) and the rest will be M (for male):
SQL> begin
2 for a in 1..10000 loop
3 insert into salaries values
4 (decode(ceil(dbms_random.value*100),1,'F','M'));
5 end loop;
6 end;
7 /
PL/SQL procedure successfully completed.
SQL>
Check the distribution of values is as expected:
SQL> select gender, count(*) from salaries
2 group by gender
3 /
GENDER COUNT(*)
------ ----------
M 9902
F 98
SQL>
Add a salary column to the table:
SQL> alter table salaries add
2 (salary number)
3 /
Table altered.
SQL>
Now input random salaries. Males have a salary between 1 and 20000. Females have a salary between 1 and 40000:
SQL> update salaries
2 set salary = ceil(dbms_random.value*20000)
3 where gender = 'M'
4 /
9902 rows updated.
SQL> update salaries
2 set salary = ceil(dbms_random.value*40000)
3 where gender = 'F'
4 /
98 rows updated.
SQL>
Create an index on the gender column. There are far more rows for males than there are for females so the data and the associated index are said to be skewed:
SQL> create index gender_index
2 on salaries(gender)
3 /
Index created.
SQL>
Now,
without analyzing the data, work out the average salary for each
gender. Oracle does some dynamic sampling. This tells it to do a full
table scan for the males and an index lookup for the females, which is
what you would expect with the data we have created:
SQL> set autotrace on
SQL> select avg(salary) from salaries
2 where gender = 'M'
3 /
AVG(SALARY)
-----------
10030.627
Execution Plan
----------------------------------------------------------
Plan hash value: 793938830
-------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 15 | 9 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 15 | | |
|* 2 | TABLE ACCESS FULL| SALARIES | 9902 | 145K| 9 (0)| 00:00:01 |
-------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("GENDER"='M')
Note
-----
- dynamic sampling used for this statement
Statistics
----------------------------------------------------------
9 recursive calls
0 db block gets
70 consistent gets
5 physical reads
0 redo size
537 bytes sent via SQL*Net to client
491 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL> select avg(salary) from salaries
2 where gender = 'F'
3 /
AVG(SALARY)
-----------
18762.898
Execution Plan
----------------------------------------------------------
Plan hash value: 3244963775
---------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 15 | 2 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 15 | | |
| 2 | TABLE ACCESS BY INDEX ROWID| SALARIES | 98 | 1470 | 2 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | GENDER_INDEX | 98 | | 1 (0)| 00:00:01 |
---------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("GENDER"='F')
Note
-----
- dynamic sampling used for this statement
Statistics
----------------------------------------------------------
7 recursive calls
0 db block gets
64 consistent gets
0 physical reads
0 redo size
537 bytes sent via SQL*Net to client
491 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL> set autotrace off
Now
analyze the table. Note that this creates 4 entries in DBA_HISTOGRAMS. I
assume Oracle uses these to record the spread of the data:
SQL> select count(*) from dba_histograms
2 where owner = 'ORACLE'
3 and table_name = 'SALARIES'
4 /
COUNT(*)
----------
0
SQL> exec dbms_stats.gather_table_stats -
> (ownname=>'oracle', -
> tabname=>'salaries', -
> cascade=>true);
PL/SQL procedure successfully completed.
SQL> select count(*) from dba_histograms
2 where owner = 'ORACLE'
3 and table_name = 'SALARIES'
4 /
COUNT(*)
----------
4
SQL>
Work
out the average salary by gender again. Oracle does not have to do
dynamic sampling this time, presumably because it now has the entries in
DBA_HISTOGRAMS to help. (If you work out the average salary by gender
twice, without running DBMS_STATS in between, Oracle does dynamic
sampling both times – I checked.) I will return to this in future when I
have a better understanding:
SQL> set autotrace on
SQL> select avg(salary) from salaries
2 where gender = 'M'
3 /
AVG(SALARY)
-----------
10030.627
Execution Plan
----------------------------------------------------------
Plan hash value: 793938830
-------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 6 | 9 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 6 | | |
|* 2 | TABLE ACCESS FULL| SALARIES | 9902 | 59412 | 9 (0)| 00:00:01 |
-------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("GENDER"='M')
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
31 consistent gets
0 physical reads
0 redo size
537 bytes sent via SQL*Net to client
491 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL> select avg(salary) from salaries
2 where gender = 'F'
3 /
AVG(SALARY)
-----------
18762.898
Execution Plan
----------------------------------------------------------
Plan hash value: 3244963775
---------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 6 | 2 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 6 | | |
| 2 | TABLE ACCESS BY INDEX ROWID| SALARIES | 98 | 588 | 2 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN | GENDER_INDEX | 98 | | 1 (0)| 00:00:01 |
---------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("GENDER"='F')
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
29 consistent gets
0 physical reads
0 redo size
537 bytes sent via SQL*Net to client
491 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
SQL>
Summary:
I also ran this example on Oracle 9.2.0.7.0. After creating the index,
Oracle used it to work out the average salary for both genders. No
dynamic sampling was attempted. After running DBMS_STATS, 4 new entries
appeared in DBA_HISTOGRAMS but Oracle did a full table scan to work out
the average salary for both genders. The behaviour under Oracle
10.2.0.3.0 is a great improvement.