Here’s a little test that should have been the second test I ran when ** column groups** were first made available in Oracle. It’s a check on the second of the two basic algorithms that Oracle uses for combining predicates:

- sel(predA
predB) = sel(predA)*and******sel(predB) - sel(predA
predB) = sel(predA)*or***+**sel(predB) – sel(predApredB)*and*

As I pointed out many years ago in ** Cost Based Oracle – Fundamentals** these are the formulae for combining probabilities of

**events. But if the predicates are not independent the formulae make the optimizer produce bad cardinality estimates and that’s why the Oracle developers introduced column groups – virtual columns that gather statistics about combinations of columns.**

*independent*So let’s set up an extreme example and test the results (I’ve run it one 21c and 19c, but the issue is the same for all versions with column groups):

```
rem
rem Script: column_group_and_or.sql
rem Author: Jonathan Lewis
rem Dated: Dec 2021
rem
rem Last tested
rem 21.3.0.0
rem 19.11.0.0
rem
create table t1
as
with generator as (
select
rownum id
from dual
connect by
level <= 1e4 -- > comment to avoid WordPress format issue
)
select
mod(rownum,100) n1,
mod(rownum,100) n2,
lpad(rownum,10,'0') v1,
lpad('x',100,'x') padding
from
generator
;
prompt =========================================
prompt Run the script twice,
prompt first without gathering the column group
prompt then after gathering the column group
prompt =========================================
begin
dbms_stats.gather_table_stats(
ownname => null,
tabname => 'T1',
method_opt => 'for columns (n1, n2) size 1'
);
end;
/
column column_name format a32
select
column_name, num_distinct, data_default
from
user_tab_cols
where
table_name = 'T1'
/
```

This script gives you a table with 10,000 rows, where the ** n1 **and

**columns always have exactly the same values as each other, and each column has 100 rows each of 100 distinct values. If you allow the pl/sql block to run the column stats query will report the following:**

*n2*```
COLUMN_NAME NUM_DISTINCT DATA_DEFAULT
-------------------------------- ------------ --------------------------------------------------------------------------------
N1 100
N2 100
V1 10000
PADDING 1
SYS_STUBZH0IHA7K$KEBJVXO5LOHAS 100 SYS_OP_COMBINED_HASH("N1","N2")
5 rows selected.
```

After creating the data (and statistics) we can execute the following simple statements:

```
set autotrace traceonly explain
select *
from t1
where
n1 = 50
/
select *
from t1
where
n1 = 50
and n2 = 50
/
select *
from t1
where
n1 = 50
or n2 = 50
/
set autotrace off
```

We know that the first query will return 100 rows, and that the optimizer has enough information to get the correct estimate: the selectivity of *“n1 = 50”* is 1/100, the number of rows in the table is 10,000, so the estimate should be 1/100 * 10,000 = 100.

For the second query: if we don’t create the column group we expect the optimizer to multiply the selectivities of the two predicates together before multiplying by the number of rows in the table to get an estimate of (1/100 * 1/100) * 10,000 = 1

If we’ve created the column group we expect the optimizer to use the column group selectivity to calculate its estimate (1/100 * 10,000) = 100.

Then we get to the third query: if we don’t create the column group we expect the optimizer to slot 1/100 into the 4 places it has to appear in the formula to get an estimate (1/100 + 1/100 – (1/100 * 1/100)) * 10,000 = 199.

If we have created the column group the third expression in the formula can use the column group 1/** num_distinct** for the third expression to get: (1/100 + 1/100 – (1/100)) * 10,000 = 100

Here are the three execution plans (with a little cosmetic work) when the column group exists:

```
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 100 | 11800 | 26 (4)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| T1 | 100 | 11800 | 26 (4)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("N1"=50)
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 100 | 11800 | 26 (4)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| T1 | 100 | 11800 | 26 (4)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("N1"=50 AND "N2"=50)
--------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 199 | 23482 | 27 (8)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| T1 | 199 | 23482 | 27 (8)| 00:00:01 |
--------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("N1"=50 OR "N2"=50)
```

As you can see from the results in the ** Rows **column, the optimizer has used the column group statistics for the

**estimate, but not for the**

*AND***estimate, resulting in an estimate that is out by a factor of nearly 2.**

*OR*It should be noted that failure to use the column group for AND’ed predicates can introduce a massive error in cardinality estimates but for OR’ed predicates the worst it can be is a factor of 2.

### Summary

There is a possible enhancement to the the optimizer’s use of column group statistics that could make the arithmetic consistent for AND and OR predicates and could allow the optimizer to produce better estimates for OR conditions.

Although any error in cardinality estimates can lead to undesirable execution plans the scale of the error due to the current inconsistency will be at worst a factor or 2, which is unlikely to introduce many surprise plans. The nature of the error means the estimate will be too big, which might push the optimizer into using tablescans and hash joins when indexed accesses and nested loops might be the better choice.