Elasticsearch Field Options Norms

Elasticsearch 定义字段时Norms选项的作用

本文介绍ElasticSearch中2种字段(text 和 keyword)的Norms参数作用。

创建ES索引时，一般指定2种配置信息：settings、mappings。settings 与数据存储有关（几个分片、几个副本）；而mappings 是数据模型，类似于MySQL中的表结构定义。在Mapping信息中指定每个字段的类型，ElasticSearch支持多种类型的字段(field datatypes)，比如String、Numeric、Date…其中String又细分成为种：keyword 和 text。在创建索引时，需要定义字段并为每个字段指定类型，示例如下：

PUT my_index
{"settings": {"number_of_shards": 1,"number_of_replicas": 0},"mappings": {"_doc": {"_source": {"enabled": true},"properties": {"title": {"type": "text","norms": false},"overview": {"type": "text","norms": true},"body": {"type": "text"},"author": {"type": "keyword","norms": true},"chapters": {"type": "keyword","norms": false},"email": {"type": "keyword"}}}}
}

my_index 索引的 title 字段类型是 text，而 author 字段类型是 keyword。

对于 text 类型的字段而言，默认开启了norms，而 keyword 类型的字段则默认关闭了norms

Whether field-length should be taken into account when scoring queries. Accepts true（text filed datatype） or false(keyword filed datatype)

为什么 keyword 类型的字段默认关闭 norms 呢？keyword 类型的string 可理解为：Do index the field, but don't analyze the string value，也即：keyword 类型的字段是不会被Analyzer "分析成" 一个个的term的，它是一个single-token fields，因此也就不需要字段长度(fieldNorm)、tfNorm（term frequency Norm）这些归一化因子了。而 text 类型的字段会被分析器(Analyzer)分析，生成若干个terms，两个 text 类型的字段，一个可能有很多term(比如文章的正文)，另一个只有很少的term(比如文章的标题)，在多字段查询时，就需要长度归一化，这就是为什么 text 类型字段默认开启 norms 选项的原因吧。另外，对于Lucene常用的2种评分算法：tf-idf 和 bm25，tf-idf 就倾向于给长度较小的字段打高分，为什么呢？Lucene 的相似度评分公式，主要由三部分组成：IDF score，TF score 还有 fieldNorms。就TF-IDF评分公式而言，IDF score 是log(numDocs/(docFreq+1))，TF score 是 sqrt(tf)，fieldNorms 是 1/sqrt(length)，因此：文档长度越短，fieldNorms越大，评分越高，这也是为什么TF-IDF严重偏向于给短文本打高分的原因。

norms 作用是什么？

norms 是一个用来计算文档/字段得分(Score)的"调节因子"。TF-IDF、BM25算法计算文档得分时都用到了norms参数，具体可参考这篇文章中的Lucene文档得分计算公式。

ElasticSearch中的一篇文档(Document)，里面有多个字段。查询解析器(QueryParser)将用户输入的查询字符串解析成Terms ，在多字段搜索中，每个 Term 会去匹配各个字段，为每个字段计算一个得分，各个字段的得分经过某种方式(以词为中心的搜索 vs 以字段为中心的搜索)组合起来，最终得到一篇文档的得分。

ES官方文档关于Norms解释：

Norms store various normalization factors that are later used at query time in order to compute the score of a document relatively to a query.

这里的 normalization factors 用于查询计算文档得分时进行 boosting。比如根据BM25算法给出的公式(freq*(k1+1))/(freq+k1*(1-b+b*fieldLength/avgFieldLength))计算文档得分时，其中的fieldLength/avgFieldLength就是 normalization factors。

norms 的代价

开启norms之后，每篇文档的每个字段需要一个字节存储norms。对于 text 类型的字段而言是默认开启norms的，因此对于不需要评分的 text 类型的字段，可以禁用norms，这算是一个调优点吧。

Although useful for scoring, norms also require quite a lot of disk (typically in the order of one byte per document per field in your index, even for documents that don’t have this specific field). As a consequence, if you don’t need scoring on a specific field, you should disable norms on that field

norms 因子属于 Index-time boosting一部分，也即：在索引文档(写入文档)的时候，就已经将所有boosting因子存储起来，在查询时从内存中读取，参与得分计算。参考《Lucene in action》中一段话：

During indexing, all sources of index-time boosts are combined into a single floating point number for each indexed field in the document. The document may have its own boost; each field may have a boost; and Lucene computes an automatic boost based on the number of tokens in the field (shorter fields have a higher boost). These boosts are combined and then compactly encoded (quantized) into a single byte, which is stored per field per document. During searching, norms for any field being searched are loaded into memory, decoded back into a floating-point number, and used when computing the relevance score.

另一种类型的 boosting 是search time boosting，在查询语句中指定boosting因子，然后动态计算出文档得分，具体可参考：《relevant search with applications for solr and elasticsearch》，本文不再详述。但是值得注意的是：目前的ES版本已经不再推荐使用index time boosting了，而是推荐使用 search time boosting。ES官方文档给出的理由如下：

在索引文档时存储的boosting因子(开启 norms 选项)，一经存储，就无法改变。要想改变，只能reindex索引
search time boosting 的效果和 index time boosting是一样的，并且search time boosting能够动态指定boosting因子(但计算文档得分时更消耗CPU吧)，灵活性更大。而index time boosting需要额外的存储空间
index time boosting因子存储在norms字段，它影响了 field length normalization，从而导致文档相似度计算结果不太准确(lower quality relevance calculations)

附：my_index索引的mapping 信息：

GET my_index/_mapping{"my_index": {"mappings": {"_doc": {"properties": {"author": {"type": "keyword","norms": true},"body": {"type": "text"},"chapters": {"type": "keyword"},"email": {"type": "keyword"},"overview": {"type": "text"},"title": {"type": "text","norms": false}}}}}
}

Elasticsearch Field Options Norms相关推荐

【ELK】02、ElasticSearch基础
一.搜索引擎海量日志如何分析,需要搜索引擎(这只是其中的一种解决方案) 索引引擎全文搜索依赖全文索引搜索引擎一般由两部分组成: 索引组件:存储数据并构建索引原始内容 --> 获取 --&g ...
Elasticsearch Reference 5.5 中文翻译7
Breaking changes in 5.0 在5.0版本中的更新 This section discusses the changes that you need to be aware of w ...
Elasticsearch搜索引擎：ES的segment段合并原理
在讲 segment 之前,我们先用一张图了解下 ES 的整体存储架构图,方便后面内容的理解: 一.segment文件的合并流程: 当我们往 ElasticSearch 写入数据时,数据是先写入 me ...
【Elasticsearch】Elasticsearch 7.3 的 offheap 原理
1.概述转载:Elasticsearch 7.3 的 offheap 原理一直以来,ES 堆中常驻内存中占据比重最大是 FST,即 tip(terms index) 文件占据的空间,1TB 索引大 ...
elasticsearch 性能优化
所有的修改都可以在elasticsearch.yml里面修改,也可以通过api来修改.推荐用api比较灵活 1.不同分片之间的数据同步是一个很大的花费,默认是1s同步,如果我们不要求实时性,我们可以执 ...
elasticsearch你了解多少？
搜索引擎 1. 了解搜索技术 1.1. 什么是搜索什么是搜索, 计算机根据用户输入的关键词进行匹配,从已有的数据库中摘录出相关的记录反馈给用户. 线性匹配: select * from item w ...
ElasticSearch【从入门到服务器部署项目案例】详细教程
了解百度 , 谷歌的搜索技术什么是搜索? 计算机根据用户输入的关键词进行匹配,从已有的数据库中摘录出相关的记录反馈给用户. 线性匹配: select * from item where titl ...
基于 docker 搭建 elasticsearch:5.6.8 分布式集群环境
文章目录 1. 目录结构 2. 前置配置 3. 单机版 es 集群搭建 4. es 客户端工具安装 5. 通过 api 操作索引 5.1 创建索引 5.2 创建 index 对应的 mapping 5 ...
Kibana + Elasticsearch + ik分词的集群搭建
Elasticsearc: Elasticsearch 是一个分布式的搜索和分析引擎,可以用于全文检索.结构化检索和分析,并能将这三者结合起来.Elasticsearch 基于 Lucene 开发,是 ...
python field详解_Django中models Field详解
在model中添加字段的格式一般为: field_name = field_type(**field_options) 一 field options(所有字段共用) 1 null 默认为F ...

Elasticsearch Field Options Norms

Elasticsearch 定义字段时Norms选项的作用

Elasticsearch Field Options Norms相关推荐

最新文章

热门文章