Summarize 总结
一、Purpose of Histograms
二、When Oracle Database Creates Histograms
三、How Oracle Database Chooses the Histogram Type
四、Cardinality Algorithms When Using Histograms
- 4.1 Endpoint Numbers and Values
- 4.2 Popular and Nonpopular Values
- 4.3 Bucket Compression
五、Frequency Histograms
- 5.1 Criteria For Frequency Histograms
- 5.2 Generating a Frequency Histogram
六、Top Frequency Histograms
- 6.1 Criteria For Top Frequency Histograms
- 6.2 Generating a Top Frequency Histogram
七、Height-Balanced Histograms (Legacy)
- 7.1 Criteria for Height-Balanced Histograms
- 7.2 Generating a Height-Balanced Histogram
八、Hybrid Histograms
- 8.1 How Endpoint Repeat Counts Work
- 8.2 Criteria for Hybrid Histograms
- 8.3 Generating a Hybrid Histogram
九、引用其他资料

Summarize 总结

histograms有两种类型,我们可以从视图dba_histograms / user_histograms,dba_tab_histograms查询

基于高度的histograms
每个范围包括相同数量的值，根据每个范围的终点的列值来判断数据的分布
基于数值的histograms
当列中不同的值的数量少于或等于histograms的buckets数量时，建立数值histograms
这种histograms列中每个值都有对应的bucket，根据每个值对应的bucket的个数来判断数据的分布

【11 Histograms】直方图

A histogram is a special type of column statistic that provides more detailed information about the data distribution in a table column. A histogram sorts values into “buckets,” as you might sort coins into buckets.
直方图是特殊类型的列统计信息，提供了表列中数据分布详细信息。直方图将数据值排序为“桶”，就像将硬币叠成桶的形状。

Based on the NDV¹ and the Distribution Of The Data², the database chooses the type of histogram to create.
Oralce根据 不同值数目(NDV) 和 数据分布情况(Distribution Of The Data) 选择创建直方图的种类。
In some cases, when creating a histogram, the database samples an internally predetermined number of rows（Dynamic Sampling³）.
有时数据库通过 动态采样(Dynamic Sampling) 的来创建直方图。

The types of histograms are as follows:
直方图的类型如下:

Frequency histograms and top frequency histograms
频率直方图和最高频率直方图
Height-Balanced histograms (legacy)
高度平衡直方图(遗留)
Hybrid histograms
混合柱状图

一、Purpose of Histograms

一、直方图的目的

By default the optimizer assumes a uniform distribution of rows across the distinct values in a column.
默认情况下，优化器(Optimizer-CBO)假定行在列中的不同值之间是均匀分布的。
For columns that contain Data Skew⁴ (a nonuniform distribution of data within the column), a histogram enables the optimizer to generate accurate Cardinality⁵ estimates for filter and Join Predicates⁶ that involve these columns.
对于包含 数据倾斜(Data Skew) 的列，直方图使优化器能够为涉及这些列的筛选器和 连接谓词(Join Predicates) 生成准确的 基数(Cardinality) 估计。
For example, a California-based book store ships 95% of the books to California, 4% to Oregon, and 1% to Nevada. The book orders table has 300,000 rows. A table column stores the state to which orders are shipped. A user queries the number of books shipped to Oregon. Without a histogram, the optimizer assumes an even distribution of 300000/3 (the NDV is 3), estimating cardinality at 100,000 rows. With this estimate, the optimizer chooses a Full Table Scan⁷ . With a histogram, the optimizer calculates that 4% of the books are shipped to Oregon, and chooses an Index Scan⁸ .
例如，一家位于加州的书店将95%的图书运往加州，4%运往俄勒冈，1%运往内华达州。book orders表有30万行。表列存储订单被发送到的状态。用户查询运送到俄勒冈州的图书数量。如果没有直方图，优化器假定平均分布为300000/3 (NDV为3)，估计基数为100,000行。根据这个估计，优化器选择 全表扫描(Full Table Scan) 。通过直方图(Histograms)，优化器(Optimizer - CBO)计算出4%的图书被运送到俄勒冈州，并选择 索引扫描(Index Scan) 。
（一般索情况下， 索引扫描(Index Scan) 适用于聚合函数(count,max,min)，或数据量小于全表10%以下的场景。）

二、When Oracle Database Creates Histograms

二、Oralce何时创建直方图

If DBMS_STATS gathers statistics for a table, and if queries have referenced the columns in this table, then Oracle Database creates histograms automatically as needed according to the previous query workload.
如果使用 DBMS_STATS 收集过一个表的统计信息。在之后查询中引用了该表中的列，那么Oracle数据库会根据前面的查询工作负载自动创建直方图。
The basic process is as follows:
基本流程如下:
1.You run DBMS_STATS for a table with the METHOD_OPT parameter set to the default SIZE AUTO.
1. 可以对 METHOD_OPT 参数设置为默认 SIZE AUTO 的表运行 DBMS_STATS 。

Oralce 收集统计信息方式1_SYS.DBMS_STATS

2.A user queries the table.
2. 用户查询表。

3.The database notes the predicates in the preceding query and updates the data dictionary table SYS.COL_USAGE$.
3. 数据库记录前面查询中的 谓词(Predicates)，并更新数据字典基表 SYS.COL_USAGE$。

4.You run DBMS_STATS again, causing DBMS_STATS to query SYS.COL_USAGE$ to determine which columns require histograms based on the previous query workload.
4. 再次运行 DBMS_STATS，导致 DBMS_STATS 查询 SYS.COL_USAGE$ ，进而根据前面的查询工作负载确定哪些列需要直方图。

Consequences of the AUTO feature include the following:
METHOD_OPT 参数设置为默认 SIZE AUTO 的后果（Consequences一般指不好的后果）:

As queries change over time, DBMS_STATS may change which statistics it gathers. For example, even if the data in a table does not change, queries and DBMS_STATS operations can cause the plans for queries that reference these tables to change.
随着查询的变化， DBMS_STATS 可能会改变它收集的统计信息。例如，即使表中的数据没有更改，查询和 DBMS_STATS 操作也会导致引用这些表的查询计划发生更改。
If you gather statistics for a table and do not query the table, then the database does not create histograms for columns in this table. For the database to create the histograms automatically, you must run one or more queries to populate the column usage information in SYS.COL_USAGE$.
如果收集表的统计信息而不查询表，则数据库不会为该表中的列创建直方图。为了让数据库自动创建直方图，您必须运行一个或多个查询来填充 SYS.COL_USAGE$ 表中的 => 列使用信息。

Example 11-1 Automatic Histogram Creation
案例 11-1 自动柱状图创建
Assume that sh.sh_ext is an External Table⁹ that contains the same rows as the sh.sales table. You create new table sales2 and perform a Bulk Load¹⁰ using sh_ext as a source, which automatically creates statistics for sales2 (see “Online Statistics Gathering for Bulk Loads”). You also create indexes as follows:
假设 sh.sh_ext 是一个 外部表(External Table)，与 sh.sales 表数据相同。可以使用 sh_ext 作为数据源，执行批量加载(Bulk Load)创建新的表 sales2，将自动对 sales2 收集统计信息。

SQL> CREATE TABLE sales2 AS SELECT * FROM sh_ext;
SQL> CREATE INDEX sh_12c_idx1 ON sales2(prod_id);
SQL> CREATE INDEX sh_12c_idx2 ON sales2(cust_id,time_id);

You query the data dictionary to determine whether histograms exist for the sales2 columns. Because sales2 has not yet been queried, the database has not yet created histograms:
通过查询数据字典，来确定是否存在 sales2 列的直方图。因为 sales2 还没有被查询，所以数据库还没有创建直方图。

SQL> SELECT COLUMN_NAME, NOTES, HISTOGRAM 2  FROM   USER_TAB_COL_STATISTICS 3  WHERE  TABLE_NAME = 'SALES2';COLUMN_NAME   NOTES          HISTOGRAM
------------- -------------- ---------
AMOUNT_SOLD   STATS_ON_LOAD  NONE
QUANTITY_SOLD STATS_ON_LOAD  NONE
PROMO_ID      STATS_ON_LOAD  NONE
CHANNEL_ID    STATS_ON_LOAD  NONE
TIME_ID       STATS_ON_LOAD  NONE
CUST_ID       STATS_ON_LOAD  NONE
PROD_ID       STATS_ON_LOAD  NONE

You query sales2 for the number of rows for product 42, and then gather table statistics using the GATHER AUTO option:
在 sales2 中查询 prod_id = 42 的数据量，然后使用 OPTIONS => GATHER AUTO收集表统计信息:

SQL> SELECT COUNT(*) FROM sales2 WHERE prod_id = 42;COUNT(*)
----------12116SQL> EXEC DBMS_STATS.GATHER_TABLE_STATS(USER,'SALES2',OPTIONS=>'GATHER AUTO');

A query of the data dictionary now shows that the database created a histogram on the prod_id column based on the information gather during the preceding query:
再次查询数据字典，可以看到，通过刚才的查询sql，收集信息并创建了关于prod_id列的直方图

SQL> SELECT COLUMN_NAME, NOTES, HISTOGRAM 2  FROM   USER_TAB_COL_STATISTICS 3  WHERE  TABLE_NAME = 'SALES2';COLUMN_NAME   NOTES          HISTOGRAM
------------- -------------- ---------
AMOUNT_SOLD   STATS_ON_LOAD  NONE
QUANTITY_SOLD STATS_ON_LOAD  NONE
PROMO_ID      STATS_ON_LOAD  NONE
CHANNEL_ID    STATS_ON_LOAD  NONE
TIME_ID       STATS_ON_LOAD  NONE
CUST_ID       STATS_ON_LOAD  NONE
PROD_ID       HISTOGRAM_ONLY FREQUENCY  # prod_id列的直方图

三、How Oracle Database Chooses the Histogram Type

三、如何选择直方图类型

Oracle Database uses several criteria to determine which histogram to create: frequency, top frequency, height-balanced, or hybrid.
Oracle数据库创建直方图共有以下四种标准:1.频率(frequency)、2.最高频率(top frequency)、3.高度平衡(height-balanced)、4.混合(hybrid)。

The histogram formula uses the following variables:
通过以下变量计算:

NDV
This represents the number of distinct values in a column. For example, if a column only contains the values 100, 200, and 300, then the NDV for this column is 3.
这表示一列中 不同值的数目(NDV)。例如，如果某列只包含值100、200和300，则该列的NDV为3。（Oracle通过 NDV 值和总行数判断 Cardinality => 基数。）
n
This variable represents the number of histogram buckets. The default is 254.
该变量表示直方图桶的个数。默认值是254。
p
This variable represents an internal percentage threshold that is equal to (1–(1/n)) * 100. For example, if n = 254, then p is 99.6.
该变量表示内部百分比阈值，等于(1 - (1/n)) * 100。例如，如果n = 254，则p = 99.6。

An additional criterion is whether the estimate_percent parameter in the DBMS_STATS statistics gathering procedure is set to AUTO_SAMPLE_SIZE (default).
另一个条件是 DBMS_STATS 统计数据收集过程中，采样比例(estimate_percent) 参数是否设置为 抽样调查(AUTO_SAMPLE_SIZE)。

The following diagram shows the decision tree for histogram creation.
创建直方图的决策树(见下图)。

Figure 11-1 Decision Tree for Histogram Creation
图 11-1 创建直方图的决策树

Description of “Figure 11-1 Decision Tree for Histogram Creation”

This graphic depicts a flow chart for the creation of a histogram type. On the left is a circle containing a question mark. It points right toward a diamond that contains “NDV>n.” This points down to a circle labeled “Frequency Histogram.” The arrow is labeled “No.” The diamond points right, with arrow labeled “Yes,” to a diamond that contains “ESTIMATE_PERCENT=AUTO_SAMPLE_SIZE.” This diamond points down, with arrow labeled “No,” to a circle labeled “Height-Balanced Histogram.” The diamond points right, with arrow labeled “Yes,” to a diamond labeled “Percentage of rows for top n frequent values >= p.” The diamond points down, with arrow labeled “No,” to a circle labeled “Hybrid Histogram.” The diamond points right, with arrow labeled “Yes,” to a circle labeled “Top n Frequency Histogram.” The legend says: “NDV = Number of distinct values; n = Number of histogram buckets (default is 254)”; “p = (1–(1/n))*100.”
此图为创建直方图类型的流程图。左边是一个包含问号的圆。
1.指向一个写着 NDV > n 的菱形。向下箭头上写着 不是(No)，指向一个标有 频率直方图(Frequency Histogram) 的圆圈。向右箭头上写着 是(Yes)
2.指向一个写着 ESTIMATE_PERCENT=AUTO_SAMPLE_SIZE 的菱形。向下箭头上写着 不是(No)，指向一个写着 高度平衡直方图(Height-Balanced Histogram) 的圆圈。向右箭头上写着 是(Yes)
3.指向一个标记为 Percentage of rows for top n frequent values >= p的菱形，向下箭头上写着 不是(No)，指向一个标有 混合直方图(Hybrid Histogram) 的圆圈。向右箭头上写着 是(Yes)
4.指向一个写着 Top n Frequency Histogram 的圆圈。
注释1: NDV = 不同值的数目;
注释2: n = 直方图桶数(默认为254);
注释3: p = (1 - (1 / n)) * 100。

四、Cardinality Algorithms When Using Histograms

四、使用直方图时的基数算法

For histograms, the algorithm for cardinality depends on factors such as the Endpoint Numbers¹¹ and values, and whether column values are popular or nonpopular.
对于直方图，基数算法取决于 端点数(Endpoint Numbers) 和值等因素，以及列值是流行(popular)还是 不流行(nonpopular)。

4.1 Endpoint Numbers and Values

4.1 结点数和值

An endpoint number is a number that uniquely identifies a bucket. In frequency and hybrid histograms, the endpoint number is the cumulative frequency of all values included in the current and previous buckets. For example, a bucket with endpoint number 100 means the total frequency of values in the current and all previous buckets is 100. In height-balanced histograms, the optimizer numbers buckets sequentially, starting at 0 or 1. In all cases, the endpoint number is the bucket number.
端点数(Endpoint Numbers)是桶的id(唯一标识数字)。在频率直方图(Frequency Histograms)和混合直方图(Hybrid Histograms)中，端点数(Frequency Histograms)是当前和以前的桶中包含的所有值的累积频率。例如，端点数值为100的桶表示当前桶和所有以前桶中值的总频率为100。在高度平衡直方图(Height-balanced Histograms)中，优化器按顺序为桶编号，从0或1开始。在所有情况下，端点数都是桶号。

An endpoint value is the highest value in the range of values in a bucket. For example, if a bucket contains only the values 52794 and 52795, then the endpoint value is 52795.
端点值(Endpoint Value)是桶中值范围内的最大值。例如，如果一个桶只包含值52794和52795，则端点值为52795。

4.2 Popular and Nonpopular Values

4.2 流行(值)和不流行(值)

The popularity of a value in a histogram affects the cardinality estimate algorithm as follows:
直方图中某个值的流行程度对基数估计算法:

Popular values
流行值
A popular value occurs as an endpoint value of multiple buckets. The optimizer determines whether a value is popular by first checking whether it is the endpoint value for a bucket. If so, then for frequency histograms, the optimizer subtracts the endpoint number of the previous bucket from the endpoint number of the current bucket. Hybrid histograms already store this information for each endpoint individually. If this value is greater than 1, then the value is popular.
流行值表示：某个值作为多个桶的端点值(Endpoint Value)出现。首先，优化器(Optimizer)检查某个值是否是桶的端点值(Endpoint Value)，用来确认该值是流行值(Popular Value)。如果是，则对于频率直方图(Frequency Histograms)，优化器(Optimizer)将从当前桶的端点数(Endpoint Numbers)减去前一个桶的端点数(Endpoint Numbers)。混合直方图(Hybrid Histograms)已经分别存储了每个端点的信息。如果该值大于1，则该值是受欢迎的。

The optimizer calculates its cardinality estimate for popular values using the following formula:
优化器使用以下公式对流行值(Popular Value)的进行基数估计值计算：
```
cardinality of popular value = -- 流行值的基数(num of rows in table) *     -- 表的数据量 *(num of endpoints spanned by this value / total num of endpoints) -- 由该值生成的端点数目 / 端点总数
```
Nonpopular values
不流行值
Any value that is not popular is a nonpopular value. The optimizer calculates the cardinality estimates for nonpopular values using the following formula:
任何不受欢迎的值都是不流行值(Nonpopular Value)。优化器使用以下公式对不流行值(Nonpopular Value)的进行基数估计值计算：
```
cardinality of nonpopular value =   -- 不流行值的基数(num of rows in table) * density  -- 表的数据量 * 密度
```
The optimizer calculates density using an internal algorithm based on factors such as the number of buckets and the NDV. Density is expressed as a decimal number between 0 and 1. Values close to 1 indicate that the optimizer expects many rows to be returned by a query referencing this column in its predicate list. Values close to 0 indicate that the optimizer expects few rows to be returned.
优化器根据桶的数量和不同值的数量(NDV)等因素使用内部算法计算密度。密度(Density)表示小数∈[0,1]。密度越接近1，则表示优化器希望查询在其谓词列表中引用此列时返回更多的行。密度越接近0，则表示优化器希望返回很少的行。

✎ See Also：
参考：
Oracle Database Reference to learn about the DBA_TAB_COL_STATISTICS.DENSITY column
字段统计信息(DBA_TAB_COL_STATISTICS)表中的密度(Density) 字段。

4.3 Bucket Compression

4.3 桶压缩
In some cases, to reduce the total number of buckets, the optimizer compresses multiple buckets into a single bucket. For example, the following frequency histogram indicates that the first bucket number is 1 and the last bucket number is 23:
在某些情况下，为了减少桶的总数，优化器会将多个桶压缩(Compression)成一个桶。例如，根据频率直方图，第一个桶号为1，最后一个桶号为23。

ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------1          527926          527938          52794 9          5279510          5279612          5279714          5279823          52799

Several buckets are “missing.” Originally, buckets 2 through 6 each contained a single instance of value 52793. The optimizer compressed all of these buckets into the bucket with the highest endpoint number (bucket 6), which now contains 5 instances of value 52793. This value is popular because the difference between the endpoint number of the current bucket (6) and the previous bucket (1) is 5. Thus, before compression the value 52793 was the endpoint for 5 buckets.
有几个桶“不见了”。最初，桶2到6每个都包含一个值为52793的实例。优化器将所有这些桶压缩到端点号最高的桶(桶6)中，桶现在包含5个值为52793的实例。这个值（52793）很流行，因为当前桶(6)和前一个桶(1)的端点号之差是5。因此，在压缩之前，值52793是5个桶的端点(Endpoint)。
The following annotations show which buckets are compressed, and which values are popular:
下面的注释显示了哪些桶被压缩(Compression)了，哪些值是受欢迎的:

ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- ---------------- 不流行1          52792 -> nonpopular-- 流行，桶2-6被压缩到66          52793 -> buckets 2-6 compressed into 6; popular-- 流行，桶7-8被压缩到88          52794 -> buckets 7-8 compressed into 8; popular-- 不流行9          52795 -> nonpopular-- 不流行10          52796 -> nonpopular-- 流行，桶11-12被压缩到1212          52797 -> buckets 11-12 compressed into 12; popular-- 流行，桶13-14被压缩到1414          52798 -> buckets 13-14 compressed into 14; popular-- 流行，桶15-23被压缩到2323          52799 -> buckets 15-23 compressed into 23; popular

五、Frequency Histograms

五、频率直方图

In a frequency histogram, each distinct column value corresponds to a single bucket of the histogram. Because each value has its own dedicated bucket, some buckets may have many values, whereas others have few.
在频率直方图(Frequency Histograms)中，每个不同的列值对应于直方图的单个桶。因为每个值都有自己的专用桶，所以有些桶可能有很多值，而其他桶可能只有很少的值。
An analogy to a frequency histogram is sorting coins so that each individual coin initially gets its own bucket. For example, the first penny is in bucket 1, the second penny is in bucket 2, the first nickel is in bucket 3, and so on. You then consolidate all the pennies into a single penny bucket, all the nickels into a single nickel bucket, and so on with the remainder of the coins.
用以下案例来类比频率直方图【硬币排序】：使每个硬币都有自己的桶。例如，有A,B,C,D…各种硬币，将其全部放入桶中。第一枚A币在1号桶里，第二枚A币在2号桶里，第一枚B币在3号桶里，以此类推。然后将所有的A币合并到一个A桶中，将所有的B币合并到一个B桶中，以此类推，剩下的硬币也是如此。

5.1 Criteria For Frequency Histograms

5.1 频率直方图的标准

Frequency histograms depend on the number of requested histogram buckets.
频率直方图取决于对应直方图桶的数量。

As shown in the logic diagram in “How Oracle Database Chooses the Histogram Type”, the database creates a frequency histogram when the following criteria are met:
在【三、如何选择直方图类型】的逻辑图中，当满足以下条件时，数据库会创建一个频率直方图:

注释1: NDV = 不同值的数目;
注释2: n = 直方图桶数(默认为254);
注释3: p = (1 - (1 / n)) * 100。

NDV is less than or equal to n, where n is the number of histogram buckets (default 254).
不同值的数量(NDV)小于等于n (n为直方图桶个数，默认为254)。
For example, the sh.countries.country_subregion_id column has 8 distinct values, ranging sequentially from 52792 to 52799. If n is the default of 254, then the optimizer creates a frequency histogram because 8 <= 254.
例如，sh.countries.country_subregion_id列有8个不同的值，顺序从52792到52799。如果n是默认值254，那么优化器会创建一个频率直方图，因为8 <= 254。
The estimate_percent parameter in the DBMS_STATS statistics gathering procedure is set to either a user-specified value or to AUTO_SAMPLE_SIZE.
在DBMS STATS统计收集过程中，估计百分比参数被设置为用户指定的值或AUTO SAMPLE SIZE。

Starting in Oracle Database 12c, if the sampling size is the default of AUTO_SAMPLE_SIZE, then the database creates frequency histograms from a full table scan. For all other sampling percentage specifications, the database derives frequency histograms from a sample. In releases earlier than Oracle Database 12c, the database gathered histograms based on a small sample, which meant that low-frequency values often did not appear in the sample. Using density in this case sometimes led the optimizer to overestimate selectivity.
从Oracle数据库12c开始，如果采样大小是默认的AUTO_SAMPLE_SIZE，那么数据库从全表扫描创建频率直方图。对于所有其他抽样百分比规格，数据库从抽样中获得频率直方图。在Oracle Database 12c之前的版本中，数据库基于一个小样本收集直方图，这意味着低频值通常不会出现在样本中。在这种情况下使用密度有时会导致优化器高估选择性。

✎ See Also：
参考：
Oracle Database PL/SQL Packages and Types Reference to learn about AUTO_SAMPLE_SIZE
查看 => 统计信息方式1_SYS.DBMS_STATS 了解关于自动采样(AUTO_SAMPLE_SIZE) 的内容。

5.2 Generating a Frequency Histogram

5.2 生成一个频率直方图

This scenario shows how to generate a frequency histogram using the sample schemas.
以下场景展示了如何使用示例模式生成频率直方图。

Assumptions
假设

This scenario assumes that you want to generate a frequency histogram on the sh.countries.country_subregion_id column. This table has 23 rows.
假设你想生成sh.countries.country_subregion_id列的频率直方图。该表数据量为23。

The following query shows that the country_subregion_id column contains 8 distinct values (sample output included) that are unevenly distributed:
下面的查询显示country_subregion_id列包含8个分布不均匀的不同值:

SELECT country_subregion_id, count(*)
FROM   sh.countries
GROUP BY country_subregion_id
ORDER BY 1;COUNTRY_SUBREGION_ID   COUNT(*)
-------------------- ----------52792          152793          552794          252795          152796          152797          252798          252799          9

To generate a frequency histogram:
生成频率直方图

1.Gather statistics for sh.countries and the country_subregion_id column, letting the number of buckets default to 254.
1.收集sh.countries和country_subregion_id列的统计信息，让桶的数量默认为254。
For example, execute the following PL/SQL anonymous block:
```
BEGINDBMS_STATS.GATHER_TABLE_STATS ( ownname    => 'SH'
,   tabname    => 'COUNTRIES'
,   method_opt => 'FOR COLUMNS COUNTRY_SUBREGION_ID'
);
END;
```

2.Query the histogram information for the country_subregion_id column.
2.查询“country_subregion_id”列的直方图信息。
For example, use the following query (sample output included):
例如，使用以下查询:

SELECT TABLE_NAME, COLUMN_NAME, NUM_DISTINCT, HISTOGRAM
FROM   USER_TAB_COL_STATISTICS
WHERE  TABLE_NAME='COUNTRIES'
AND    COLUMN_NAME='COUNTRY_SUBREGION_ID';TABLE_NAME COLUMN_NAME          NUM_DISTINCT HISTOGRAM
---------- -------------------- ------------ ---------------
COUNTRIES  COUNTRY_SUBREGION_ID            8 FREQUENCY

The optimizer chooses a frequency histogram because n or fewer distinct values exist in the column, where n defaults to 254.
优化器选择频率直方图，因为该列中存在n或更少的不同值，其中n默认为254。

3.Query the endpoint number and endpoint value for the country_subregion_id column.
查询country_subregion_id列的端点编号和端点值。
For example, use the following query (sample output included):
例如，使用以下查询:

SELECT ENDPOINT_NUMBER, ENDPOINT_VALUE
FROM   USER_HISTOGRAMS
WHERE  TABLE_NAME='COUNTRIES'
AND    COLUMN_NAME='COUNTRY_SUBREGION_ID';ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------1          527926          527938          527949          5279510          5279612          5279714          5279823          52799

Figure 11-2 is a graphical illustration of the 8 buckets in the histogram. Each value is represented as a coin that is dropped into a bucket.
图11-2是直方图中8个桶的图示。每个值都表示为投到桶中的硬币。

Figure 11-2 Frequency Histogram
图 11-2 频率直方图

Description of “Figure 11-2 Frequency Histogram”
“图 11-2 频率直方图”说明。
As shown in Figure 11-2, each distinct value has its own bucket. Because this is a frequency histogram, the endpoint number is the cumulative frequency of endpoints. For 52793, the endpoint number 6 indicates that the value appears 5 times (6 - 1). For 52794, the endpoint number 8 indicates that the value appears 2 times (8 - 6).
如图11-2所示，每个不同的值都有自己的桶。因为这是一个频率直方图，所以端点数就是端点的累积频率。对于52793，端点编号6表示该值出现5次(6 - 1)。对于52794，端点编号8表示该值出现2次(8 - 6)。
Every bucket whose endpoint is at least 2 greater than the previous endpoint contains a popular value. Thus, buckets 6, 8, 12, 14, and 23 contain popular values. The optimizer calculates their cardinality based on endpoint numbers. For example, the optimizer calculates the cardinality (C ) of value 52799 using the following formula, where the number of rows in the table is 23:
每个端点至少比前一个端点大2的桶包含一个流行值。因此，桶6、8、12、14和23包含流行的值。优化器根据端点编号计算它们的基数。例如，优化器使用以下公式计算值52799的基数 (C ) ，其中表中的行数为23:

 -- 流行值的基数
cardinality of popular value =-- 表的数据量 *
(num of rows in table) *
(num of endpoints spanned by this value / total num of endpoints) -- 由该值生成的端点数目 / 端点总数C = 23 * ( 9 / 23 )

Buckets 1, 9, and 10 contain nonpopular values. The optimizer estimates their cardinality based on density.
桶1,9和10包含不受欢迎的值。优化器根据密度估计它们的基数。

✎ See Also：
参考：
Oracle Database PL/SQL Packages and Types Reference to learn about the DBMS_STATS.GATHER_TABLE_STATS procedure
查看 => Oracle数据字典 142-45 SYS.DBMS_STATS.gather_table_stats 了解关于收集表统计信息(GATHER_TABLE_STATS) 的内容。

Oracle Database Reference to learn about the USER_TAB_COL_STATISTICS view
查看 Oracle数据库引用了解关于 USER_TAB_COL_STATISTICS 的内容。

Oracle Database Reference to learn about the USER_HISTOGRAMS view
查看 Oracle数据库引用了解关于 USER_HISTOGRAMS 的内容。

六、Top Frequency Histograms

六、高频直方图（12c新特性）

A top frequency histogram is a variation on a frequency histogram that ignores nonpopular values that are statistically insignificant. For example, if a pile of 1000 coins contains only a single penny, then you can ignore the penny when sorting the coins into buckets. A top frequency histogram can produce a better histogram for highly popular values.
高频直方图(Top Frequency Histograms)是频率直方图的一种变体，它忽略了统计上不重要的不流行值(Nonpopular Values)。例如，如果一堆1000枚硬币中只有一枚硬币，那么在将硬币分类成桶时可以忽略这枚硬币。高频直方图(Top Frequency Histograms)可以为流行值(Popular Values)生成更好的直方图。

6.1 Criteria For Top Frequency Histograms

6.1 高频直方图的标准

If a small number of values occupies most of the rows, then creating a frequency histogram on this small set of values is useful even when the NDV is greater than the number of requested histogram buckets. To create a better quality histogram for popular values, the optimizer ignores the nonpopular values and creates a top frequency histogram.
如果少量的值占据了大部分的行，那么在这一小组值上创建一个频率直方图是有用的，即使当NDV大于请求的直方图桶的数量。为了为流行值创建一个质量更好的直方图，优化器忽略不流行的值，并创建一个高频直方图(Top Frequency Histograms)。
As shown in the logic diagram in “How Oracle Database Chooses the Histogram Type”, the database creates a top frequency histogram when the following criteria are met:
如“11-1”的逻辑图所示，当满足以下条件时，数据库会创建一个高频直方图(Top Frequency Histograms):

注释1: NDV = 不同值的数目;
注释2: n = 直方图桶数(默认为254);
注释3: p = (1 - (1 / n)) * 100。

NDV is greater than n, where n is the number of histogram buckets (default 254).
不同值的数量(NDV)比n大，n为桶数（默认254）
The percentage of rows occupied by the top n frequent values is equal to or greater than threshold p, where p is (1-(1/n))*100.
前n个高频值行占百分比>p，其中p = (1 - (1 / n)) * 100
The estimate_percent parameter in the DBMS_STATS statistics gathering procedure is set to AUTO_SAMPLE_SIZE.
DBMS_STATS统计信息收集过程中的estimate_percent参数被设置为AUTO_SAMPLE_SIZE。

✎ See Also：
参考：
Oracle Database PL/SQL Packages and Types Reference to learn about AUTO_SAMPLE_SIZE
查看 => Oracle数据字典 142-45 SYS.DBMS_STATS.gather_table_stats 了解关于收集表统计信息(GATHER_TABLE_STATS) 的内容。

6.2 Generating a Top Frequency Histogram

6.2 生成高频直方图

This scenario shows how to generate a top frequency histogram using the sample schemas.
以下场景展示了如何使用示例模式生成频率直方图。

Assumptions
假设

This scenario assumes that you want to generate a top frequency histogram on the sh.countries.country_subregion_id column. This table has 23 rows.
假设你想生成sh.countries.country_subregion_id列的高频直方图。该表数据量为23。

The following query shows that the country_subregion_id column contains 8 distinct values (sample output included) that are unevenly distributed:
下面的查询显示 country_subregion_id 列包含8个分布不均匀的不同值。

SELECT country_subregion_id, count(*)
FROM   sh.countries
GROUP BY country_subregion_id
ORDER BY 1;COUNTRY_SUBREGION_ID   COUNT(*)
-------------------- ----------52792          152793          552794          252795          152796          152797          252798          252799          9

To generate a top frequency histogram:
生成一个高频直方图

1.Gather statistics for sh.countries and the country_subregion_id column, specifying fewer buckets than distinct values.
1.收集sh.countries和country_subregion_id列的统计信息，指定比不同值(NDV)更少的桶。
For example, enter the following command to specify 7 buckets:
例如，输入如下命令指定7个桶:

BEGINDBMS_STATS.GATHER_TABLE_STATS (ownname    => 'SH'
,   tabname    => 'COUNTRIES'
,   method_opt => 'FOR COLUMNS COUNTRY_SUBREGION_ID SIZE 7'
);
END;

SELECT TABLE_NAME, COLUMN_NAME, NUM_DISTINCT, HISTOGRAM
FROM   USER_TAB_COL_STATISTICS
WHERE  TABLE_NAME='COUNTRIES'
AND    COLUMN_NAME='COUNTRY_SUBREGION_ID';TABLE_NAME COLUMN_NAME          NUM_DISTINCT HISTOGRAM
---------- -------------------- ------------ ---------------
COUNTRIES  COUNTRY_SUBREGION_ID            7 TOP-FREQUENCY

The sh.countries.country_subregion_id column contains 8 distinct values, but the histogram only contains 7 buckets, making n=7. In this case, the database can only create a top frequency or hybrid histogram. In the country_subregion_id column, the top 7 most frequent values occupy 95.6% of the rows, which exceeds the threshold of 85.7%, generating a top frequency histogram (see “Criteria For Frequency Histograms”).
sh.countries.country_subregion_id 列包含8个不同的值，但直方图只包含7个桶，因此n=7。在这种情况下，数据库只能创建高频直方图(Top Frequency Histograms)或混合直方图(Hybrid Histograms)。在country_subregion_id列中，出现频率最高的前7个值占行数的95.6% 【 22/23 ≈ 95.6% 】，超过了85.7%【 p = (1 - (1 / n)) * 100 = (1 - ( 1 / 7 ) * 100) ≈ 85.7% 】的阈值，生成了最高频率直方图(参考“高频直方图标准”)。

3.Query the endpoint number and endpoint value for the column.
查询列的端点号和端点值。

For example, use the following query (sample output included):
例如，使用以下查询:

SELECT ENDPOINT_NUMBER, ENDPOINT_VALUE
FROM   USER_HISTOGRAMS
WHERE  TABLE_NAME='COUNTRIES'
AND    COLUMN_NAME='COUNTRY_SUBREGION_ID';ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------1          527926          527938          527949          5279611          5279713          5279822          52799

Figure 11-3 is a graphical illustration of the 7 buckets in the top frequency histogram. The values are represented in the diagram as coins.
图11-3是直方图中7个桶的图示。每个值都表示为投到桶中的硬币。

Figure 11-3 Top Frequency Histogram
图 11-3 高频直方图

Description of “Figure 11-3 Top Frequency Histogram”
“图 11-3 高频直方图”说明
As shown in Figure 11-3, each distinct value has its own bucket except for 52795, which is excluded from the histogram because it is nonpopular and statistically insignificant. As in a standard frequency histogram, the endpoint number represents the cumulative frequency of values.
如图11-3所示，除了52795之外，每个不同的值都有自己的桶，因为52795不流行，统计意义不显著，所以不计入直方图。与标准频率直方图一样，端点号(Endpoint Number)表示值的累积频率。

✎ See Also：
参考：
Oracle Database PL/SQL Packages and Types Reference to learn about the DBMS_STATS.GATHER_TABLE_STATS procedure
查看 => Oracle数据字典 142-45 SYS.DBMS_STATS.gather_table_stats 了解关于收集表统计信息(GATHER_TABLE_STATS) 的内容。

Oracle Database Reference to learn about the USER_TAB_COL_STATISTICS view

Oracle Database Reference to learn about the USER_HISTOGRAMS view

七、Height-Balanced Histograms (Legacy)

七、高度平衡直方图

In a legacy height-balanced histogram, column values are divided into buckets so that each bucket contains approximately the same number of rows.
高度平衡直方图Height-Balanced Histograms (Legacy)中，列值被划分为多个桶，以便每个桶包含大致相同数量的行。

For example, if you have 99 coins to distribute among 4 buckets, each bucket contains about 25 coins. The histogram shows where the endpoints fall in the range of values.
例如，如果你有99个硬币可以分配到4个桶中，每个桶大约包含25个硬币。直方图显示了端点(endpoints)在值范围内的位置。

7.1 Criteria for Height-Balanced Histograms

7.1 高度平衡直方图的标准

Before Oracle Database 12c, the database created a height-balanced histogram when the NDV was greater than n. This type of histogram was useful for range predicates, and equality predicates on values that appear as endpoints in at least two buckets.
在Oracle数据库12c之前，当不同值数量(NDV) > n 时，数据库会高度平衡直方图。这种直方图对于范围谓词(range predicates)和在至少两个桶中作为端点出现的值上的相等谓词非常有用。

As shown in the logic diagram in “How Oracle Database Chooses the Histogram Type”, the database creates a height-balanced histogram when the following criteria are met:
如“11-1”的逻辑图所示，当满足以下条件时，数据库会创建一个高度平衡直方图(Height-Balanced Histograms):

注释1: NDV = 不同值的数目;
注释2: n = 直方图桶数(默认为254);
注释3: p = (1 - (1 / n)) * 100。

NDV is greater than n, where n is the number of histogram buckets (default 254).
不同值的数量(NDV)比n大，n为桶数（默认254）
The estimate_percent parameter in the DBMS_STATS statistics gathering procedure is not set to AUTO_SAMPLE_SIZE.
DBMS_STATS统计信息收集过程中的estimate_percent参数被设置为自动采样(AUTO_SAMPLE_SIZE)。

It follows that if Oracle Database 12c creates new histograms, and if the sampling percentage is AUTO_SAMPLE_SIZE, then the histograms are either top frequency or hybrid, but not height-balanced.
如果Oracle 12c创建了新的直方图，并且采样百分比是AUTO_SAMPLE_SIZE，那么直方图要么是最高频(Top Frequency)的，要么是混合(hybrid)的，但不是高度平衡(Height-Balanced)的。

If you upgrade Oracle Database 11g to Oracle Database 12c, then any height-based histograms created before the upgrade remain in use. However, if you refresh statistics on the table on which the histogram was created, then the database replaces existing height-balanced histograms on this table. The type of replacement histogram depends on both the NDV and the following criteria:
如果将Oracle Database 11g升级到Oracle Database 12c，那么在升级之前创建的任何基于高度的直方图都将继续使用。但是，如果刷新创建直方图的表上的统计信息，则数据库将替换该表上现有的高度平衡直方图(Height-Balanced Histograms)。替换直方图的类型既取决于不同值数量(NDV)，也取决于以下标准:

If the sampling percentage is AUTO_SAMPLE_SIZE, then the database creates either hybrid or frequency histograms.
如果采样百分比是自动采样(AUTO_SAMPLE_SIZE)，则数据库创建混合(Hybrid Histograms) 或频率直方图(Frequency Histograms)。
If the sampling percentage is not AUTO_SAMPLE_SIZE, then the database creates either height-balanced or frequency histograms.
如果采样百分比不是自动采样(AUTO_SAMPLE_SIZE)，则数据库创建高度平衡或频率直方图。

7.2 Generating a Height-Balanced Histogram

7.2 生成高度平衡直方图

This scenario shows how to generate a height-balanced histogram using the sample schemas.
以下场景展示了如何使用示例模式生成频率直方图。

Assumptions
假设

This scenario assumes that you want to generate a height-balanced histogram on the sh.countries.country_subregion_id column. This table has 23 rows.
假设你想生成 sh.countries.country_subregion_id 列的高度平衡直方图。该表数据量为23。

The following query shows that the country_subregion_id column contains 8 distinct values (sample output included) that are unevenly distributed:
下面的查询显示 country_subregion_id 列包含8个分布不均匀的不同值。

SELECT country_subregion_id, count(*)
FROM   sh.countries
GROUP BY country_subregion_id
ORDER BY 1;COUNTRY_SUBREGION_ID   COUNT(*)
-------------------- ----------52792          152793          552794          252795          152796          152797          252798          252799          9

To generate a height-balanced histogram:
生成高度平衡直方图(Height-Balanced Histogram):

1.Gather statistics for sh.countries and the country_subregion_id column, specifying fewer buckets than distinct values.
1.收集 sh.countries 和 country_subregion_id 列的统计信息，指定比不同值更少的桶。

✎ Note：
笔记：
To simulate Oracle Database 11g behavior, which is necessary to create a height-based histogram, set estimate_percent to a nondefault value. If you specify a nondefault percentage, then the database creates frequency or height-balanced histograms.
为了模拟Oracle 11g的行为(这是创建基于高度平衡直方图(Height-Based Histogram)所必需的)，将estimate_percent设置为一个非默认值。如果指定非默认百分比，则数据库将创建频率或高度平衡直方图。
- For example, enter the following command:
  例如，输入如下命令:
```
BEGIN  DBMS_STATS.GATHER_TABLE_STATS ( ownname          => 'SH'
,   tabname          => 'COUNTRIES'
,   method_opt       => 'FOR COLUMNS COUNTRY_SUBREGION_ID SIZE 7'
,   estimate_percent => 100
);
END;
```
2.Query the histogram information for the country_subregion_id column.
2.查询“country_subregion_id”列的直方图信息。
For example, use the following query (sample output included):
例如，使用以下查询:
```
SELECT TABLE_NAME, COLUMN_NAME, NUM_DISTINCT, HISTOGRAM
FROM   USER_TAB_COL_STATISTICS
WHERE  TABLE_NAME='COUNTRIES'
AND    COLUMN_NAME='COUNTRY_SUBREGION_ID';TABLE_NAME COLUMN_NAME          NUM_DISTINCT HISTOGRAM
---------- -------------------- ------------ ---------------
COUNTRIES  COUNTRY_SUBREGION_ID            8 HEIGHT BALANCED
```
The optimizer chooses a height-balanced histogram because the number of distinct values (8) is greater than the number of buckets (7), and the estimate_percent value is nondefault.
优化器选择高度平衡的直方图，因为不同值的数量(8)大于桶的数量(7)，而且estimate_percent值是非默认值(默认值为dbms_stats.auto_sample_size)。

3.Query the number of rows occupied by each distinct value.
3.查询每个不同值所占用的行数
For example, use the following query (sample output included):
例如，使用以下查询:

SELECT COUNT(country_subregion_id) AS NUM_OF_ROWS, country_subregion_id
FROM   countries
GROUP BY country_subregion_id
ORDER BY 2;NUM_OF_ROWS COUNTRY_SUBREGION_ID
----------- --------------------1                527925                527932                527941                527951                527962                527972                527989                52799

4.Query the endpoint number and endpoint value for the country_subregion_id column.
查询 country_subregion_id 列的端点号(Endpoint Value)和端点值(Endpoint Value)
For example, use the following query (sample output included):
例如，使用以下查询:

SELECT ENDPOINT_NUMBER, ENDPOINT_VALUE
FROM   USER_HISTOGRAMS
WHERE  TABLE_NAME='COUNTRIES'
AND    COLUMN_NAME='COUNTRY_SUBREGION_ID';ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------0          527922          527933          527954          527987          52799

Figure 11-4 is a graphical illustration of the height-balanced histogram. The values are represented in the diagram as coins.
图11-4是直方图中5个桶的图示。每个值都表示为投到桶中的硬币。

Figure 11-4 Height-Balanced Histogram
图 11-4 高度平衡直方图

官方文档这里少标了个52796

Description of “Figure 11-4 Height-Balanced Histogram”
“图 11-4 高度平衡直方图”说明
The bucket number is identical to the endpoint number. The optimizer records the value of the last row in each bucket as the endpoint value, and then checks to ensure that the minimum value is the endpoint value of the first bucket, and the maximum value is the endpoint value of the last bucket. In this example, the optimizer adds bucket 0 so that the minimum value 52792 is the endpoint of a bucket.
桶号和端点号相同。优化器将每个桶中最后一行的值记录为端点值，然后检查以确保最小值是第一个桶的端点值，最大值是最后一个桶的端点值。在本例中，优化器添加桶0，使最小值52792是桶的端点。
The optimizer must evenly distribute 23 rows into the 7 specified histogram buckets, so each bucket contains approximately 3 rows. However, the optimizer compresses buckets with the same endpoint. So, instead of bucket 1 containing 2 instances of value 52793, and bucket 2 containing 3 instances of value 52793, the optimizer puts all 5 instances of value 52793 into bucket 2. Similarly, instead of having buckets 5, 6, and 7 contain 3 values each, with the endpoint of each bucket as 52799, the optimizer puts all 9 instances of value 52799 into bucket 7.
优化器必须均匀地将23行分布到7个指定的直方图桶中，所以每个桶大约包含3行。优化器使用相同的端点压缩桶。因此，不是第1个桶包含2个值为52793的实例，第2个桶包含3个值为52793的实例，优化器将所有5个值为52793的实例放入第2个桶。类似地，桶5、6、7各包含3个值(每个桶的端点都为52799)，优化器桶5、6、7合并到桶7中。
In this example, buckets 3 and 4 contain nonpopular values because the difference between the current endpoint number and previous endpoint number is 1. The optimizer calculates cardinality for these values based on density. The remaining buckets contain popular values. The optimizer calculates cardinality for these values based on endpoint numbers.
在本例中，桶3和桶4包含不受欢迎的值（3-52795，4-52796），因为当前端点号和以前端点号之间的差是1。优化器根据密度为这些值计算基数。其余桶包含常用值。优化器根据端点编号计算这些值的基数。

✎ See Also：
参考：
Oracle Database PL/SQL Packages and Types Reference to learn about the DBMS_STATS.GATHER_TABLE_STATS procedure
查看 => Oracle数据字典 142-45 SYS.DBMS_STATS.gather_table_stats 了解关于收集表统计信息(GATHER_TABLE_STATS) 的内容。

Oracle Database Reference to learn about the USER_TAB_COL_STATISTICS view

Oracle Database Reference to learn about the USER_HISTOGRAMS view

八、Hybrid Histograms

八、混合直方图

A hybrid histogram combines characteristics of both height-based histograms and frequency histograms. This “best of both worlds” approach enables the optimizer to obtain better selectivity estimates in some situations.
混合直方图(Hybrid Histograms)结合了高度平衡直方图(Height-Based Histograms)和频率直方图(Frequency Histograms)的特征。这种“两全其美”的方法使优化器能够在某些情况下获得更好的性能。
The height-based histogram sometimes produces inaccurate estimates for values that are almost popular. For example, a value that occurs as an endpoint value of only one bucket but almost occupies two buckets is not considered popular.
高度平衡直方图(Height-Based Histograms)有时会对流行值(Popular Values)产生不准确的评估。例如，一个值只作为一个桶的端点值出现，但几乎占用了两个桶，这就不被认为是流行的。
To solve this problem, a hybrid histogram distributes values so that no value occupies more than one bucket, and then stores the endpoint repeat count value, which is the number of times the endpoint value is repeated, for each endpoint (bucket) in the histogram. By using the repeat count, the optimizer can obtain accurate estimates for almost popular values.
为了解决这一问题，混合直方图(Hybrid Histograms)分配值时，每个值只能占用一个桶，记录端点重复计数(Endpoint Repeat Counts)的值。优化器可以得到常用值的精确估计。

8.1 How Endpoint Repeat Counts Work

The analogy of coins distributed among buckets illustrate show endpoint repeat counts work.
在桶之间分配硬币的类比说明了端点重复计数(Endpoint Repeat Counts)的工作。
The following figure illustrates a coins column that sorts values from low to high.
下图展示了一个硬币列，它将值从低到高排序。

Figure 11-5 Coins
图 11-5 硬币

Description of Figure 11-5 follows
图 11-4 说明

You gather statistics for this table, setting the method_opt argument of DBMS_STATS.GATHER_TABLE_STATS to FOR ALL COLUMNS SIZE 3. In this case, the optimizer initially groups the values in the coins column into three buckets, as shown in the following figure.
为该表收集统计信息，设置DBMS_STATS的method_opt参数【 estimate_percent => FOR ALL COLUMNS SIZE 3 】。此案例中，优化器首先将coins列中的值分组为三个桶，如下图所示。

Figure 11-6 Initial Distribution of Values
图 11-6 初始值分布

Description of Figure 11-6 follows
图 11-5 说明
If a bucket border splits a value so that some occurrences of the value are in one bucket and some in another, then the optimizer shifts the bucket border (and all other following bucket borders) forward to include all occurrences of the value. For example, the optimizer shifts value 5 so that it is now wholly in the first bucket, and the value 25 is now wholly in the second bucket.
如果某个桶的边界分割了一个值（例如：硬币5同时出现在桶1桶2，那么就可以说硬币5被桶的边界分割了），则优化器就会向后移动该桶的边界(以及后面所有桶的边界)，以包含所有出现的值。例如，优化器将值5转移到第一个桶中，而值25现在完全在第二个桶中。

Figure 11-7 Redistribution of Values
图 11-7 重新分配值

Description of Figure 11-7 follows
图 11-7 说明
The endpoint repeat count measures the number of times that the corresponding bucket endpoint, which is the value at the right bucket border, repeats itself. For example, in the first bucket, the value 5 is repeated 3 times, so the endpoint repeat count is 3.
记录桶端点【即在桶的右边边界处的值】的端点重复计数(Endpoint Repeat Counts)的值。例如，在第一个桶中，值5重复了3次，因此端点重复计数(Endpoint Repeat Counts)为3。

Figure 11-8 Endpoint Repeat Count
图 11-8 端点重复计数

Description of Figure 11-8 follows
图 11-8 说明
Height-balanced histograms do not store as much information as hybrid histograms. By using repeat counts, the optimizer can determine exactly how many occurrences of an endpoint value exist. For example, the optimizer knows that the value 5 appears 3 times, the value 25 appears 4 times, and the value 100 appears 2 times. This frequency information helps the optimizer to generate better cardinality estimates.
高度平衡直方图(Height-Balanced Histograms)不像混合直方图(Hybrid Histograms)那样存储大量的信息。通过使用重复计数，优化器可以准确地确定端点值存在的次数。例如，优化器知道值5出现3次，值25出现4次，值100出现2次。这个频率信息帮助优化器生成更好的基数统计。

8.2 Criteria for Hybrid Histograms

8.2 混合直方图的标准

The only differentiating criterion for hybrid histograms as compared to top frequency histograms is that the top n frequent values is less than internal threshold p.
混合直方图(Hybrid Histograms)与高频直方图(Top Frequency Histograms)的唯一区别标准是，顶部n个频率值小于内部阈值p。
As shown in the logic diagram in “How Oracle Database Chooses the Histogram Type”, the database creates a hybrid histogram when the following criteria are met:
如“11-1”的逻辑图所示，当满足以下条件时，数据库会创建一个混合直方图(Hybrid Histograms):

注释1: NDV = 不同值的数目;
注释2: n = 直方图桶数(默认为254);
注释3: p = (1 - (1 / n)) * 100。

NDV is greater than n, where n is the number of histogram buckets (default is 254).
不同值的数量(NDV)比n大，n为桶数（默认254）
The criteria for top frequency histograms do not apply.
高频直方图(Top Frequency Histograms)的标准并不适用。
This is another way to stating that the percentage of rows occupied by the top n frequent values is less than threshold p, where p is (1-(1/n))*100. See “Criteria For Top Frequency Histograms.”
这是表示前n个高频值所占行的百分比小于阈值p的另一种方式，其中p为(1-(1/n))*100。参考 => “高频直方图标准”。
The estimate_percent parameter in the DBMS_STATS statistics gathering procedure is set to AUTO_SAMPLE_SIZE.
DBMS_STATS统计信息收集过程中的 estimate_percent 参数被设置为自动采样(AUTO_SAMPLE_SIZE)。
If users specify their own percentage, then the database creates frequency or height-balanced histograms.
如果自己指定 estimate_percent 的百分比，那么Oralce将创建高频直方图(Top Frequency Histograms)或高度平衡直方图(Height-Balanced Histograms)。

✎ See Also：
参考：
“Height-Balanced Histograms (Legacy)”
高度平衡直方图

8.3 Generating a Hybrid Histogram

8.3 生成混合直方图

This scenario shows how to generate a hybrid histogram using the sample schemas.
以下场景展示了如何使用示例模式生成频率直方图。

Assumptions
假设

This scenario assumes that you want to generate a hybrid histogram on the sh.products.prod_subcategory_id column. This table has 72 rows. The prod_subcategory_id column contains 22 distinct values.
sh.products 表共有72行。其中 prod_subcategory_id 列包含22个不同的值。假如下面要对 sh.products 表的 prod_subcategory_id 列生成混合直方图(Hybrid Histograms)。

To generate a hybrid histogram:
生成一个混合直方图

1.Gather statistics for sh.products and the prod_subcategory_id column, specifying 10 buckets.
1.收集 sh.products 表和 prod_subcategory_id 列的统计信息，指定10个桶。
For example, enter the following command:
例如，输入如下命令:
```
BEGIN  DBMS_STATS.GATHER_TABLE_STATS ( ownname     => 'SH'
,   tabname     => 'PRODUCTS'
,   method_opt  => 'FOR COLUMNS PROD_SUBCATEGORY_ID SIZE 10'
);
END;
```

2.Query the number of rows occupied by each distinct value.
查询每个不同值所占用的行数。
For example, use the following query (sample output included):
例如，输入如下命令:

SELECT COUNT(prod_subcategory_id) AS NUM_OF_ROWS, prod_subcategory_id
FROM   products
GROUP BY prod_subcategory_id
ORDER BY 1 DESC;NUM_OF_ROWS PROD_SUBCATEGORY_ID
----------- -------------------8                20147                20556                20326                20545                20565                20315                20425                20514                20363                20432                20332                20342                20132                20122                20532                20351                20221                20411                20441                20111                20211                205222 rows selected.
SELECT COUNT(prod_subcategory_id) AS NUM_OF_ROWS, prod_subcategory_id
FROM   products
GROUP BY prod_subcategory_id
ORDER BY 2 DESC;NUM_OF_ROWS PROD_SUBCATEGORY_ID
----------- -------------------1                20112                20122                20138                20141                20211                20225                20316                20322                20332                20342                20354                20361                20415                20423                20431                20445                20511                20522                20536                20547                20555                2056

The column contains 22 distinct values. Because the number of buckets (10) is less than 22, the optimizer cannot create a frequency histogram. The optimizer considers both hybrid and top frequency histograms. To qualify for a top frequency histogram, the percentage of rows occupied by the top 10 most frequent values must be equal to or greater than threshold p, where p is (1-(1/10))*100, or 90%. However, in this case the top 10 most frequent values occupy 54 rows out of 72, which is only 75% of the total. Therefore, the optimizer chooses a hybrid histogram because the criteria for a top frequency histogram do not apply.
该列包含22个不同的值。因为桶的数量(10)小于22，所以优化器不能创建频率直方图(Frequency Histograms)。优化器只会考虑使用混合直方图(Hybrid Histograms)和高频直方图(Top Frequency Histograms)。为了符合高频直方图(Top Frequency Histograms)，top 10最频繁值所占行的百分比必须大于或等于阈值p(其中p为(1-(1/10))*100)或90%。然而，在本例中，最频繁出现的10个值占据了72行中的54行，仅占总数的75%。因为不符合高频直方图(Top Frequency Histograms)的标准，优化器选择混合直方图。

3.Query the histogram information for the country_subregion_id column.
3.查询“country_subregion_id”列的直方图信息。
For example, use the following query (sample output included):
例如，输入如下命令:

SELECT TABLE_NAME, COLUMN_NAME, NUM_DISTINCT, HISTOGRAM
FROM   USER_TAB_COL_STATISTICS
WHERE  TABLE_NAME='PRODUCTS'
AND    COLUMN_NAME='PROD_SUBCATEGORY_ID';TABLE_NAME COLUMN_NAME         NUM_DISTINCT HISTOGRAM
---------- ------------------- ------------ ---------
PRODUCTS   PROD_SUBCATEGORY_ID 22           HYBRID

4.Query the endpoint number, endpoint value, and endpoint repeat count for the country_subregion_id column.
4.查询 country_subregion_id 列的端点号(Endpoint Number)、端点值(Endpoint Value)、端点重复计数(Endpoint Repeat Count)。
For example, use the following query (sample output included):
例如，输入如下命令:

SELECT ENDPOINT_NUMBER, ENDPOINT_VALUE, ENDPOINT_REPEAT_COUNT
FROM   USER_HISTOGRAMS
WHERE  TABLE_NAME='PRODUCTS'
AND    COLUMN_NAME='PROD_SUBCATEGORY_ID'
ORDER BY 1;ENDPOINT_NUMBER ENDPOINT_VALUE ENDPOINT_REPEAT_COUNT
--------------- -------------- ---------------------1           2011                     113           2014                     826           2032                     636           2036                     445           2043                     351           2051                     552           2052                     154           2053                     260           2054                     672           2056                     510 rows selected.

In a height-based histogram, the optimizer would evenly distribute 72 rows into the 10 specified histogram buckets, so that each bucket contains approximately 7 rows. Because this is a hybrid histogram, the optimizer distributes the values so that no value occupies more than one bucket. For example, the optimizer does not put some instances of value 2036 into one bucket and some instances of this value into another bucket: all instances are in bucket 36.
高度平衡直方图Height-Balanced Histograms中，优化器会将72行平均分配到10个指定的直方图桶中，这样每个桶大约包含7行。因为这是一个混合直方图(Hybrid Histograms)，优化器不会让任一个值同时出现在两个桶中。例如，2036这个值一共出现了4次，他们都在36桶中，并不会出现一部分(2036)在36桶，另一部分(2036)在其他桶的情况。
The endpoint repeat count shows the number of times the highest value in the bucket is repeated. By using the endpoint number and repeat count for these values, the optimizer can estimate cardinality. For example, bucket 36 contains instances of values 2033, 2034, 2035, and 2036. The endpoint value 2036 has an endpoint repeat count of 4, so the optimizer knows that 4 instances of this value exist. For values such as 2033, which are not endpoints, the optimizer estimates cardinality using density.
端点重复计数(Endpoint Repeat Count)显示桶中最大值被重复的次数。通过使用端点号(Endpoint Number)并对这些值重复计数，优化器可以估计基数。例如，第36个桶的实例值为2033、2034、2035、2036。端点值2036的端点重复计数为4，因此优化器知道存在这个值的4个实例。对于像2033这样不是端点的值，优化器会使用密度来估计基数。

✎ See Also：

Oracle Database PL/SQL Packages and Types Reference to learn about the DBMS_STATS.GATHER_TABLE_STATS procedure
查看 => Oracle数据字典 142-45 SYS.DBMS_STATS.gather_table_stats 了解关于收集表统计信息(GATHER_TABLE_STATS) 的内容。
Oracle Database Reference to learn about the USER_TAB_COL_STATISTICS view

Oracle Database Reference to learn about the USER_HISTOGRAMS view

关于端点(endpoint)的三个值的含义：

端点号(Endpoint Number)：累计值，累计到当前桶的最后一列数据，一共有多少行。
端点值(Endpoint Value)：每个桶(bucket)中最后一行的值记录为端点值，其中最小值是第一个桶的端点值，最大值是最后一个桶的端点值。
端点重复计数(Endpoint Repeat Count)：值重复的次数。

九、引用其他资料

待整理，有空再研究

一、直方图含义及分类
为了防止列分布不均带来SQL查询的性能问题对Oracle产生严重的影响，Oracle引入了直方图，它是一个详细描述列数据分布的特殊的列统计信息，实际存放在HISTGRM基表中，可通过DBA_TAB_HISTOGRAMS、DBA_PART_HISTOGRAMS、DBA_SUBPART_HISTOGRAMS来分别查看列直方图信息。Oracle只对常用的目标列收集直方图信息，默认为不常用的列不需要收集直方图信息，SQL中的谓词条件会存在SYS.COL_USAGE$基表中，在执行DBMS_STATS包时首先先查询SYS.COL_USAGE$基表，只收集基表中有信息的列的直方图信息。
Oracle直方图实际上是通过bucket(桶)的方式分别从ENDPOINT NUMBER和ENDPOINT VALUE两个维度来描述目标列的数据分布。分别对应DBA_TAB_HISTOGRAMS、DBA_PART_HISTOGRAMS、DBA_SUBPART_HISTOGRAMS中的ENDPOINT_NUMBER/BUCKET_NUMBER和ENDOINT_VALUE字段。还可通过DBA_TAB_COL_STATISTICS、DBA_PART_COL_STATISTICS和DBA_SUBPART_COL_STATISTICS查询bucket总数。
ENDPOINT NUMBER是直方图中bucket的编号，由DBMS.STATS包中的METHOD_OPT参数中的size控制buckets的数量(默认为254)，在Oracle11g中规定该值不能大于254，但在12C中该值可以达到2048。
ENDPOINT VALUE是直方图bucket的结束点，CBO可通过该值来准确计算目标结果集在目标列中的selectivity。

直方图类型：
在Oracle 12c之前直方图类型为Frequency和Height Balanced两种类型（在Oracle 12c中为了解决Height Balanced不准确引起的Oracle性能问题而引入了Top-Frequency和Hybrid类型的直方图）。在Oracle 11g中如果目标列NUM_DISTINCT值的数量小于设定的Bucket的数量则使用Frequency类型直方图，若目标列distinct值的数量大于Bucket的数量则使用Height Balanced类型直方图。在Oracle 12c以后DBMS_STATS包的METHOD_OPT参数引入了AUTO_SAMPLE_SIZE参数（默认是TRUE），目标列NUM_DISTINCT大于设定的Bucket的数量时满足一定条件会自动转换成Top-Frequency或Hybrid类型的直方图。
这里根据本次问题发生的现象，重点介绍一下Height-Balanced直方图和Hybrid直方图。

1、Height Balanced类型直方图
在Oracle数据库里（Oracle 12c之前），Frequency类型的直方图所对应的bucket的数量不能超过254，即Frequency类型的直方图只适用于那些目标列的distinct值的数量小于或等于254的情况，如果目标列的distinct值的数量大于254，此时Oracle会对目标列收集Height Balanced直方图。Oracle首先会根据目标列对目标表的所有记录按从小到大的顺序排序，然后用目标表总的记录数除以需要使用的bucket数量，来决定每个bucket里需要描述的已排好序的记录数。
目标列中的popular_value:简单的说就是目标列出现频率较多的值，例如图一中的52793。nopopular_value：目标列出现频率低的值，例如图一中的52792。在Height Balanced类型的直方图中完全忽略了nopopular_value值，将nopopular_value和popular_value存放在一起，由于NUM_DISTINCT>Bucket，所以Height Balanced类型的直方图bucket和数据存放形式如图一所示，Oracle会自动将数据随机存放（即存在一个popular_value值存在不同的bucket中），这样的存放会影响 CBO对bucket的计算和寻找，影响Oracle的性能。在Height Balanced类型的直方图的selectivity和Cardinality的计算更为复杂，随着数据库版本不同也不同。

22/03/30

NDV ：不同值的数目（NDV）。Oracle通过 NDV 值和总行数判断 Cardinality => 基数。
↩︎
Distribution Of The Data ：数据分布情况（Distribution Of The Data）。详见高中数学-分布列。
↩︎
Dynamic Sampling ：动态采样(Dynamic Sampling)。按照一定比例抽样获取统计信息
这里原文说的是samples an internally predetermined number of rows => 按照预定的行数进行抽样
当优化器(Optimizer)是基于 CBO(代价)的规则时，执行一条SQL，且共享池（share pool）中没有该SQL的执行计划(如果有则是软解析)，则通过硬解析方式，根据SQL对象的统计信息(Statistic) 来生成执行计划(Execution Plans)，这时Oralce发现该对象没有统计信息，此时会通过动态采样(Dynamic Sampling) 的方式抽取一定比例数据进行分析，通过该对象临时的统计信息(Statistic)来生成执行计划，但这个统计信息是不准确的，不会存入统计信息中，所以执行计划(Execution Plans)也是不准确的。
↩︎
Data Skew ：数据偏斜（Data Skew）。数据取值的分布列不均匀。
假设某列有A有a,b,c,d四种取值，该表共有1万条数据，其中取值为a,b,c的数据各有1000条，取值为d的数据有7000条，这时我们就可以认为A列是数据偏斜的。
Large variations in the number of duplicate values in a column.
某列中重复值的数量不平均，某一/几列变化很大。
↩︎
Cardinality ：基数（Cardinality ）。执行计划中的操作返回的行数 / 预期返回的行数。
在生产执行计划过程中，Oralce会先根据谓词（where条件）预估返回数据量。其中基数=表数据量*选择率。所以讲基数重点是讲选择率，选择率的计算。Oracle 执行计划（2）-基数 cardinality
The number of rows that is expected to be or is returned by an operation in an execution plan. Data has low cardinality when the number of distinct values in a column is low in relation to the total number of rows.
当列中不同值的数量(NDV) 相对于总行数较低时，数据基数(Cardinality) 较低。(这里没看明白）。
↩︎
Join Predicates ：连接谓词（Join Predicates）。查询中的where条件，在查看执行计划时可以看到Predicate Information，对应的就是连接谓词(Join Predicates) 。
↩︎
Full Table Scan ：全表扫描（Full Table Scan）。
A scan of table data in which the database sequentially reads all rows from a table and filters out those that do not meet the selection criteria. All data blocks under the high water mark are scanned.
数据库从表中按顺序读取所有行=> 全表扫描(Full Table Scan) ，并过滤掉不满足选择条件的行。所有高水位（HWM）标记下的数据块都被扫描。(意思是扫描历史最高水位以下所有数据块，最高水位以上没有数据，所以也不需要扫描）
↩︎
Index Scan ：索引扫描（Index Scan）。
Table Access by ROWID or rowid lookup
行的ROWID（指针）指出了该行所在的数据文件、数据块以及行在该块中的位置。通过ROWID读取数据可以快速定位到目标数据上，是Oracle存取单行数据的最快方法。
↩︎
External Table ：外部表。外部表只能在Oracle 9i之后来使用。简单地说，外部表，是指不存在于数据库中的表。通过向Oracle提供描述外部表的元数据，我们可以把一个操作系统文件当成一个只读的数据库表，就像这些数据存储在一个普通数据库表中一样来进行访问。外部表是对数据库表的延伸。
↩︎
Bulk Load ：批量加载。Hbase也有BulkLoad功能，其大概原理是面对大批量导入数据场景时，通过直接加载底层存储文件，加快数据加载执行效率的一种方式。关于oracle批量加载的描述可以查看文档=> Online Statistics Gathering for Bulk Loads。
↩︎
Endpoint_Numbers ：结点数。在直方图中桶的id(唯一标识数字)。在频率直方图和混合直方图中，Endpoint_Numbers是Endpoint的累积频率。在高度平衡直方图中，Endpoint_Numbers就是桶的id。
↩︎

2.1_11 Oralce 执行计划之3_直方图(Histograms)相关推荐

mysql 8.0 新特性统计直方图优化执行计划SQL查询
| 概览 MySQL8.0实现了统计直方图.利用直方图,用户可以对一张表的一列做数据分布的统计,特别是针对没有索引的字段.这可以帮助查询优化器找到更优的执行计划.统计直方图的主要使用场景是用来计算字 ...
2场直播丨Oracle数据库SQL执行计划的取得和解析、一次特殊的 Oralce 硬解析性能问题的技术分享...
1. Oracle数据库SQL执行计划的取得和解析- 2021.02.23 2月23日(周二)晚八点直播课,Oracle优化资深专家陈晓辉,以专业ORACLE数据库技术支持工程师的角度讲解SQL文的执 ...
解决 Oralce 执行set autotrace on时的SP2-0618和SP2-0611错误
在Oracle sqlplus查看执行计划时. 我们一般使用 set autotrace on; 接下来的sql语句就会自动显示execution plan. 但是有时再执行set autotrace ...
分析Oracle有时会用索引来查找数据的原因－oracle执行计划
http://www.webjx.com/database/oracle-140.html 问:为什么Oracle有时会用索引来查找数据? 答:在你运用SQL语言,向数据库发布一条查询语句时,Orac ...
看懂Oracle执行计划（转载）
转载自写的很好,屯一波最近一直在跟Oracle打交道,从最初的一脸懵逼到现在的略有所知,也来总结一下自己最近所学,不定时更新ing- 一:什么是Oracle执行计划? 执行计划是一条查询语句在Or ...
为什么预估执行计划与真实执行计划会有差异？
云和恩墨北区技术工程师专注于 SQL 审核和优化相关工作.曾经服务的客户涉及金融保险.电信运营商.政府.生产制造等行业. 郭成日本文由恩墨大讲堂154期线上分享整理而成. 一问题概要对同一个 S ...
浅析SQL SERVER执行计划中的各类怪相
在查看执行计划或调优过程中,执行计划里面有些现象总会让人有些疑惑不解: 1:为什么同一条SQL语句有时候会走索引查找,有时候SQL脚本又不走索引查找,反而走全表扫描? 2:同一条SQL语句,查询条件的 ...
t-sql执行结果_解释T-SQL查询的执行计划
t-sql执行结果 In this article, we will analyze a simple T-SQL query execution plan with different aspect ...
sql server运算符_SQL Server执行计划中SELECT运算符的主要概念
sql server运算符 One of the main responsibilities of a database administrator is query tuning and troub ...

2.1_11 Oralce 执行计划之3_直方图(Histograms)

目录

Summarize 总结

一、Purpose of Histograms

二、When Oracle Database Creates Histograms

三、How Oracle Database Chooses the Histogram Type

四、Cardinality Algorithms When Using Histograms

4.1 Endpoint Numbers and Values

4.2 Popular and Nonpopular Values

4.3 Bucket Compression

五、Frequency Histograms

5.1 Criteria For Frequency Histograms

5.2 Generating a Frequency Histogram

六、Top Frequency Histograms

6.1 Criteria For Top Frequency Histograms

6.2 Generating a Top Frequency Histogram

七、Height-Balanced Histograms (Legacy)

7.1 Criteria for Height-Balanced Histograms

7.2 Generating a Height-Balanced Histogram

八、Hybrid Histograms

8.1 How Endpoint Repeat Counts Work

8.2 Criteria for Hybrid Histograms

8.3 Generating a Hybrid Histogram

九、引用其他资料

2.1_11 Oralce 执行计划之3_直方图(Histograms)相关推荐

最新文章

热门文章