elasticstack学习 part2

catelog

基于词项和全文的搜索
结构化搜索
搜索的相关性算分
query filter 与多字符串
单字符串多字段查询
单字符串多字段查询multimatch
实战
search Template 和 index alias
综合排序：Function Score Query 优化算分
Term & Phrase Suggester
自动补全与基于上下文的提示
跨集群查询
集群分布式模型及选主与脑裂问题
分片与集群故障的转移
文档分布式存储
分片及声明周期
剖析分布式查询及相关性算分
排序及docValues&fieldData
分页与遍历
bucket&metric 聚合分析与嵌套聚合
pipeline聚合分析
范围作用与排序
聚合分析的原理及精确度问题
对象及nested对象
文档的父子关系
updateByQuery和reIndex api
ingestPipeline 和 painlessScript
数据建模
数据建模最佳实践
总结

基于词项和全文的搜索

基于term查询

索引时desc字段使用了分词器, 索引时转换成了小写的iphone, term查询desc因为没有用分词器,最终使用的term为iPhone查询, 所以没有查到
如果使用term要查询到, 就需要查询分词后的term 或者对该字段的keyword进行查询

POST /products/_bulk
{ "index": { "_id": 1 }}
{ "productID" : "XHDK-A-1293-#fJ3","desc":"iPhone" }
{ "index": { "_id": 2 }}
{ "productID" : "KDKE-B-9947-#kL5","desc":"iPad" }
{ "index": { "_id": 3 }}
{ "productID" : "JODL-X-1937-#pV7","desc":"MBP" }POST products/_search
{"query": {"term": {"desc": {"value": "iPhone"}}},"profile": "true"
}

分词效果

term查询也会进行算分, 即使是keyword字段, constant_score filter取消算分, 减少性能消耗, 利用缓存

POST /products/_search
{"explain": true,"query": {"constant_score": {"filter": {"term": {"productID.keyword": "XHDK-A-1293-#fJ3"}}}}
}

基于全文

match query 会将查询的目标进行分词成term 每个term单独进行查询, 汇总结果 ;
match phrase 会将单词视为一个整体, 并且关注位置关系, 使用slot进行偏差

结构化搜索

对布尔值进行搜索

POST products/_search
{"query": {"term": {"avaliable": {"value": "true"}}}
}POST products/_search
{"query": {"constant_score": {"filter": {"term": {"avaliable": true}}}}
}

数字

"query": {"constant_score": {"filter": {"range": {"price": {"gte": 10,"lte": 20}}}}}

对日期进行搜索
当前时间减去4年, 也就是搜索4年以内的

"query": {"constant_score": {"filter": {"range": {"date": {"gte" : "now-4y"}}}}}

查询非空

"query": {"constant_score": {"filter": {"exists": {"field": "date"}}}}

查询多值字段
查询类型包含comedy的 , 而不是精确只有comedy

POST movies/_search
{"query": {"constant_score": {"filter": {"term": {"genre.keyword": "Comedy"}}}}
}

搜索的相关性算分

tm 词频 , 词项在该文档中的频率 , 例如我是中国人, 生在中国, 中国出现了2次;
df 检索词在所有文档中的频率 , 翻转idf, 例如中国在200个文档中出现过, 一共有1000个文档, log(1000/200)

idf 词与该文档的差异率 ,
lucene使用tm -idf,idf加权tm求分 , 之后改为了bm25, 解决了tf 无限增加分值无限增大的问题 , es可以在创建索引时指定算分方式

explain解析算分细节
两条都包含目标, 但是id2的文档长度更短, tf分值更高

PUT testscore/_bulk
{ "index": { "_id": 1 }}
{ "content":"we use Elasticsearch to power the search" }
{ "index": { "_id": 2 }}
{ "content":"we like elasticsearch" }
{ "index": { "_id": 3 }}
{ "content":"The scoring of documents is caculated by the scoring formula" }
{ "index": { "_id": 4 }}
{ "content":"you know, for search" }POST testscore/_search
{"query": {"match": {"content": "elasticsearch"}},"explain": true
}

使用boosting来控制算分结果, 例negative对包含like的文档降权,

POST testscore/_search
{"query": {"boosting" : {"positive" : {"term" : {"content" : "elasticsearch"}},"negative" : {"term" : {"content" : "like"}},"negative_boost" : 0.2}}
}

query filter 与多字符串

bool查询, 组合多个字段的查询条件
must, should参与评分,filter和mustnot不参与评分

#基本语法
POST /products/_search
{"query": {"bool" : {"must" : {"term" : { "price" : "30" }},"filter": {"term" : { "avaliable" : "true" }},"must_not" : {"range" : {"price" : { "lte" : 10 }}},"should" : [{ "term" : { "productID.keyword" : "JODL-X-1937-#pV7" } },{ "term" : { "productID.keyword" : "XHDK-A-1293-#fJ3" } }],"minimum_should_match" :1}}
}

单字符串多字段查询

使用disjunction_max来查询多字段, 对比各个字段评分, 取最高评分


PUT /blogs/_doc/1
{"title": "Quick brown rabbits","body":  "Brown rabbits are commonly seen."
}PUT /blogs/_doc/2
{"title": "Keeping pets healthy","body":  "My quick brown fox eats rabbits on a regular basis."
}
POST blogs/_search
{"query": {"dis_max": {"queries": [{"match": {"title": "Brown fox"}},{"match": {"body": "Brown fox"}}]}},"explain": true
}

例id为1的虽然在两个字段中都包括了brown, 但是两个字段的brown结果最终取了一个最大的分值, id为2的是将brown和fox两个分值加起来, fox只在一个文档中出现, 更罕见, 理所应当分值更高, 所以id为2的更符合要求, 分值更大 , 排在前面

如果只搜索Quick pets , 两个文档评分相同, 因为每个文档包含的单词都是相同的
使用tie_breaker对评分更均衡, 之前是只取最高字段, tie_breaker会加权其他字段并加入总分,

POST blogs/_search
{"query": {"dis_max": {"queries": [{ "match": { "title": "Quick pets" }},{ "match": { "body":  "Quick pets" }}],"tie_breaker": 0.2}}
}

如果是考虑多个字段的算分, 自己感觉可以直接用bool替代

POST /blogs/_search
{"query": {"bool": {"should": [{ "match": { "title": "Quick pets" }},{ "match": { "body":  "Quick pets" }}]}},"explain": true
}

单字符串多字段查询multimatch

例用barking dogs只查询title结果为id1分值高, 因为文档短, 但实际id2更符合搜索目标, 针对这种场景, 需要增加id2的分值, 增加title.std字段, 对两个字段查询
multimatch在写法上比dis_max更简单, 默认使用best_fields,也就是disjunction_max,

DELETE /titles
PUT /titles
{"mappings": {"properties": {"title": {"type": "text","analyzer": "english","fields": {"std": {"type": "text","analyzer": "standard"}}}}}
}POST titles/_bulk
{ "index": { "_id": 1 }}
{ "title": "My dog barks" }
{ "index": { "_id": 2 }}
{ "title": "I see a lot of barking dogs on the road " }GET titles/_search
{"query": {"match": {"title": "barking dogs"}}
}GET titles/_search
{"query": {"multi_match": {"type": "most_fields",//"type": "best_fields", "query": "barking dogs","fields": ["title","title.std"]}}
}

实战

将tmdb导入es中, mapping中title使用english分词器, 通过multi_match用token ““basketball with cartoon aliens”” 搜索出空中大灌篮 ;
如果是默认标准分词器进行索引, 搜索不出结果
multi_match默认使用的best_fields模式, 仅使用最高分的字段的分数

"mappings": {"properties": {"overview": {"type": "text","analyzer": "english","fields": {"std": {"type": "text","analyzer": "standard"}}},"popularity": {"type": "float"},"title": {"type": "text","analyzer": "english","fields": {"keyword": {"type": "keyword","ignore_above": 256}}}}"query": {"multi_match": {"query": "basketball with cartoon aliens","fields": ["title^10","overview"]}}

Windows 安装pyenv

pip install pyenv-win --target %USERPROFILE%/.pyenv

如何使用pyenv在windows10安装多个python版本环境
我环境变量有问题, 直接在pyenv的目录下cmd

pyenv install 2.7.15
pyenv versions
python -V
pyenv global 2.7.15
pyenv global

默认mapping , 默认查询

mapping , english分词器, most_filed 模式

默认mapping , 默认查询, space jam只有basketball命中, 文档中的alien 因为分词器保持了aliens 就没有命中

search Template 和 index alias

查询模板, 通过预置查询脚本 , 之后查询就可以引用该模板, 还可以引用变量, 之后可以直接修改模板, 改变查询结果


POST tmdb/_search
{"_source": ["title","overview"],"size":20,"query": {"multi_match": {"type": "most_fields", "query": "basketball with cartoon aliens","fields": ["title","overview"]}},"explain": true
}POST _scripts/tmdb
{"script": {"lang": "mustache","source": {"_source": ["title","overview"],"size": 20,"query": {"multi_match": {"query": "{{q}}","fields": ["title","overview"]}}}}
}POST tmdb/_search/template
{"id":"tmdb","params": {"q": "basketball with cartoon aliens"}
}

索引别名
对一个索引创建别名 , 使用别名代替索引名称进行查询 , 并能对别名设置额外的过滤规则

PUT movies-2019/_doc/1
{"name":"the matrix","rating":5
}PUT movies-2019/_doc/2
{"name":"Speed","rating":3
}//创建别名
POST _aliases
{"actions": [{"add": {"index": "movies-2019","alias": "movies-latest"}}]
}
//别名查询 两条结果
POST movies-latest/_search
{"query": {"match_all": {}}
}//创建别名及过滤规则
POST _aliases
{"actions": [{"add": {"index": "movies-2019","alias": "movies-lastest-highrate","filter": {"range": {"rating": {"gte": 4}}}}}]
}
//仅1条结果
POST movies-lastest-highrate/_search
{"query": {"match_all": {}}
}

综合排序：Function Score Query 优化算分

对以下文档进行查询 , 内容均相同, 对点赞数进行加权的效果, 更改算分的逻辑, 将;
function score 可以在查询到内容后, 自由的更改算分的方式 , 比如使用脚本, 自定义逻辑,
例field_value_factor中默认使用得分 * 额外算分的值 ,


DELETE blogs
PUT /blogs/_doc/1
{"title":   "About popularity","content": "In this post we will talk about...","votes":   0
}PUT /blogs/_doc/2
{"title":   "About popularity","content": "In this post we will talk about...","votes":   100
}PUT /blogs/_doc/3
{"title":   "About popularity","content": "In this post we will talk about...","votes":   1000000
}POST /blogs/_search
{"query": {"function_score": {"query": {"multi_match": {"query":"popularity","fields": ["title", "content"]}},"field_value_factor": {//算分字段"field": "votes",//修改函数"modifier": "log1p" ,//"factor": 0.1},//默认是乘法, 可以改成sum加法"boost_mode": "sum",//每个文档最大的分值为3"max_boost": 3}}
}

随机种子
例保证一个用户在浏览广告时, 当前用户看到的广告排序是相同的, 用户的session,或id 作为seed,
这样可以增加广告的展现率;

POST /blogs/_search
{"query": {"function_score": {"random_score": {"seed": 911119,"field": "_seq_no"}}}
}

Term & Phrase Suggester

推荐词项 , 如果输入错误的词, 会查询词库中有无这词, 如果没有, 会根据输入的词匹配并返回类似的词项
例输入了lucen rock, 用suggest查询设置相同的字段和关键词, 会返回推荐词 lucene和rocks
默认是用了missing 模式, 如果没有这个词项才进行推荐, 推荐的算分是根据关键词token和目标token的字符差异来的


DELETE articles
POST articles/_bulk
{ "index" : { } }
{ "body": "lucene is very cool"}
{ "index" : { } }
{ "body": "Elasticsearch builds on top of lucene"}
{ "index" : { } }
{ "body": "Elasticsearch rocks"}
{ "index" : { } }
{ "body": "elastic is the company behind ELK stack"}
{ "index" : { } }
{ "body": "Elk stack rocks"}
{ "index" : {} }POST articles/_search
{"size": 1, "query": {"match": {"body": "lucen rock"}},"suggest": {"term-suggestion": {"text": "lucen rock","term": {"field": "body","suggest_mode":"missing"//词项匹配容忍的前缀, 输入hock也会推荐rock,"prefix_length":0}}}
}

Phrase Suggester增加了更多的参数, confidence控制了返回结果的阈值 , 只有候选词高于该标准的才会返回

POST /articles/_search
{"suggest": {"my-suggestion": {"text": "lucne and elasticsear rock hello world ","phrase": {"field": "body","max_errors":2,"confidence":2,"direct_generator":[{"field":"body","suggest_mode":"always"}],"highlight": {"pre_tag": "<em>","post_tag": "</em>"}}}}
}

自动补全与基于上下文的提示

输入关键词, 查询es返回关键词的补全信息,
es不通过倒排索引来实现, 而是通过fst实现 , fst介绍一种内存占用小, 类似map的结构, 适合做前缀匹配

创建索引需要对补全的字段配置,suggest选择completion
例一下查询得到前缀为elk的补全提示


DELETE articles
PUT articles
{"mappings": {"properties": {"title_completion":{"type": "completion"}}}
}POST articles/_bulk
{ "index" : { } }
{ "title_completion": "lucene is very cool"}
{ "index" : { } }
{ "title_completion": "Elasticsearch builds on top of lucene"}
{ "index" : { } }
{ "title_completion": "Elasticsearch rocks"}
{ "index" : { } }
{ "title_completion": "elastic is the company behind ELK stack"}
{ "index" : { } }
{ "title_completion": "Elk stack rocks"}
{ "index" : {} }POST articles/_search
{"size": 0, "suggest": {"article-suggester": {"prefix":"elk","completion": {"field": "title_completion"}}}
}

除了补全, 还有根据上下文进行的提示, suggest context
mapping增加context补全类型, 在索引时对文档选择补全类型, 查询时提供补全类型, 就相当于增加了查询条件


DELETE comments
PUT comments
PUT comments/_mapping
{"properties":{"comment_autocomplete":{"type":"completion","contexts":[{"type":"category","name":"comment_category"}]}}
}POST comments/_doc
{"comment":"I love the star war movies","comment_autocomplete":{"input":["star wars"],"contexts":{"comment_category":"movies"}}
}POST comments/_doc
{"comment":"Where can I find a Starbucks","comment_autocomplete":{"input":["starbucks"],"contexts":{"comment_category":"coffee"}}
}POST comments/_search
{"suggest": {"MY_SUGGESTION": {"prefix": "sta","completion":{"field":"comment_autocomplete","contexts":{"comment_category":"movies"}}}}
}

何种场景适合何种查询

跨集群查询

单机群, master的压力大, 成为性能瓶颈 , 不能无限扩荣节点;
es早期通过tribe node 支持跨集群查询 , 需要加入集群节点, 查询时经过该节点, 重启慢 , 集群索引重名问题;
5.3后版本支持cross cluster search , 不需要加入已client node加入集群, 任何节点都能作为查询请求节点

在win上的demo

启动3个集群

bin/elasticsearch -E node.name=cluster0node -E cluster.name=cluster0 -E path.data=cluster0_data -E discovery.type=single-node -E http.port=9200 -E transport.port=9300
bin/elasticsearch -E node.name=cluster1node -E cluster.name=cluster1 -E path.data=cluster1_data -E discovery.type=single-node -E http.port=9201 -E transport.port=9301
bin/elasticsearch -E node.name=cluster2node -E cluster.name=cluster2 -E path.data=cluster2_data -E discovery.type=single-node -E http.port=9202 -E transport.port=9302

使用postman发送请求

curl -XPUT "http://localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{"persistent":{"cluster":{"remote":{"cluster0":{"seeds":["127.0.0.1:9300"],"transport.ping_schedule":"30s"},"cluster1":{"seeds":["127.0.0.1:9301"],"transport.compress":true,"skip_unavailable":true},"cluster2":{"seeds":["127.0.0.1:9302"]}}}}}'curl -XPUT "http://localhost:9201/_cluster/settings" -H 'Content-Type: application/json' -d'
{"persistent":{"cluster":{"remote":{"cluster0":{"seeds":["127.0.0.1:9300"],"transport.ping_schedule":"30s"},"cluster1":{"seeds":["127.0.0.1:9301"],"transport.compress":true,"skip_unavailable":true},"cluster2":{"seeds":["127.0.0.1:9302"]}}}}}'curl -XPUT "http://localhost:9202/_cluster/settings" -H 'Content-Type: application/json' -d'
{"persistent":{"cluster":{"remote":{"cluster0":{"seeds":["127.0.0.1:9300"],"transport.ping_schedule":"30s"},"cluster1":{"seeds":["127.0.0.1:9301"],"transport.compress":true,"skip_unavailable":true},"cluster2":{"seeds":["127.0.0.1:9302"]}}}}}'#创建测试数据, 每个集群的数据不同
curl -XPOST "http://localhost:9200/users/_doc" -H 'Content-Type: application/json' -d'
{"name":"user1","age":10}'curl -XPOST "http://localhost:9201/users/_doc" -H 'Content-Type: application/json' -d'
{"name":"user2","age":20}'curl -XPOST "http://localhost:9202/users/_doc" -H 'Content-Type: application/json' -d'
{"name":"user3","age":30}'

测试查询多个集群的结果


GET /users,cluster1:users,cluster2:users/_search
{"query": {"range": {"age": {"gte": 10,"lte": 40}}}
}

集群分布式模型及选主与脑裂问题

每个es节点都是java进程, 可配置集群名称 , 节点也可配置名字;
节点类型:

协助节点, 所有节点默认都是, 可处理请求, 生产上要固定角色;
数据节点, 所有节点默认都是, 可保存分片, 数据扩展;
主节点, 维护索引, 集群信息, 保存分片位置;
候选主节点, 所有节点默认都是, 主机故障时参与主节点的选举;

脑裂
当网络波动, 一个集群被分成两个网络区域, 不存在主节点的区域集群, 又会进行选举产生主节点, 提供对外服务, 之后网络恢复后, 集群恢复时没被选出的主节点会丢失他这段时间的数据

旧版本, 配置选举阈值, 集群中的节点数大于阈值才进行选举, 避免脑裂 ,7.0开始移除阈值配置, es自己管控

分片与集群故障的转移

分片是lucene的index, 索引创建后主分片数不能修改,
副本分片实现数据高可用, 可以热动态调整, 副本支持查询请求, 增加了副本数量也相当于增加了吞吐量;

分片数设置
数量少了, 难以支持数据扩容 , 数量多了影响性能 ;
副本数多了需要的同步工作就多 , 影响写入 ;

例如当前集群主分片3, 副本数1
主节点故障后, 先选出主节点, 数据被合理的分配到了其他的节点上;

文档分布式存储

为了保证数据分布均匀, 性能的利用率 ,文档的存储的位置默认通过文档id取模主分片数算出, 所以主分片数不能更改; 也可以通过指定的数据取模, 只分配到某个分片;

更新文档
请求节点hash算出文档存储位置, 更新时先删除,后创建后响应给请求节点

删除文档
请求节点路由至文档位置, 删除该主分片文档, 然后再删除副本分片文档, 响应

分片及声明周期

倒排索引不可变性
好处:
不考虑并发写问题, 避免锁问题;
只要内存够, 第一次从文件系统读取到缓存中, 之后读缓存;
缓存容易生成和维护, 数据可以被压缩;

坏处:
想要文档变成被搜索到的状态, 就需要重建整个索引

lucene index
lucene的1个倒排索引为segment , 多个倒排索引由commit point记录, 已经删除的记录在.del文件中 , 查询时会遍历全部的segments, 过滤已删除的 ;

refresh
文档索引首先会写入index buffer缓存, 再1秒(可配置)之后将缓存中的所有文档生成为segment,segment没有落盘, 也在缓存, index buffer生成segment的过程为refresh;
index buffer空间默认为10%, 空间满也会触发refresh;
refresh不执行fsync;

transaction log
为了保证数据写缓存后优先提供查询功能下能不丢失数据, 在写入缓存时也会将数据写入事务日志落盘, 在生成segment后仍然不会删除事务日志中的数据;
每个分片有一个transaction log;
transaction log默认500M;

flush
目的就是将所有缓存中的数据持久化落盘, 首先会执行refresh操作, 然后将缓存中的segment落盘, 之后删除事务日志;
默认30分钟发生1次, 事务日志写满时也会触发flush;

merge
在segment不断落盘后, 数量变得越来越多, merge会将这些零散的segment进行合并, 并且清空.del文件的数据;
merge操作是es自动管理的, 也可以通过api触发;

剖析分布式查询及相关性算分

分布式搜索的运行机制 query then fetch

查询时请求节点随机挑选所有存储数据的主分片和副本分片, 执行条件查询
在每个分片取回from + size数量的文档, 内容仅有文档id和排序值
将所有文档在请求节点重新排序, 保留from+size 的文档, 再使用multiquery去对应分片查询真正的文档详情
最后响应给客户端;

这种方式的弊端是请求节点需要接收n*size的数据量, 并且算分是在每个分片进行完成的, 如果存储不均匀, 算分就会不准确;

如何避免
在数据量少的情况下, 将分片数设置为1, 不会进行分布式搜索, 就不会有请求节点汇总数据;
平均分配存储避免算分不准的情况, 或者通过dfs query then fetch 将详细的算分数据传回请求节点进行计算, 这种方式耗费的性能大;

例 20个分片, 存储3个文档
“good”
“good morning”
“good morning everyone”

在普通query下, 每个分片的idf文档数量都是1,仅算当前分片的

而def query则是将3个分片的信息汇总进行算分的

排序及docValues&fieldData

对字段排序时, 不计算得分, score为null,

#多字段排序
POST /kibana_sample_data_ecommerce/_search
{"size": 5,"query": {"match_all": {}},"sort": [{"order_date": {"order": "desc"}},{"_doc":{"order": "asc"}},{"_score":{ "order": "desc"}}]
}

默认不能对text字段进行排序, 需要打开fielddata设置

PUT kibana_sample_data_ecommerce/_mapping
{"properties": {"customer_full_name" : {"type" : "text","fielddata": true,"fields" : {"keyword" : {"type" : "keyword","ignore_above" : 256}}}}
}

fielddata是正排索引, id关联了数据内容, 可以实现text全文本类型的数值排序;
默认使用的docvalue是列式存储方式 , 跟随索引一起创建, 减少内存占用, 增加了索引的维护成本;
fielddate可以随时开启关闭, 但是docvalue的改变需要重建索引

分页与遍历

因为es是分布式分片存储文档, 当查询一个from100 , size 10的场景中, 需要去每个分片查询110条, 将110*n shardNum汇总在请求节点, 进行重排序, 这样会出现深分页的问题;
es默认单次查询结果小于10000条文档, 超过会报错;

search after

指定search使用的排序字段和唯一标识字段(一般是_id), 每次查询只要将指定位置的文档集合返回给请求节点, size* shardNum ;
下次查询只要传入排序字段值和文档id就实现了向下翻页;
不能指定页码 , 只能不断向下翻页

DELETE users
POST users/_doc
{"name":"user1","age":10}POST users/_doc
{"name":"user2","age":11}POST users/_doc
{"name":"user2","age":12}POST users/_doc
{"name":"user2","age":13}POST users/_search
{"query": {"match_all": {}},"size": 2,"sort": [{"age":  "desc"},{"_id":  "asc"}]
}

scroll api
第一次查询时调用scroll生成当前搜索结果的快照, 之后读快照查询;
之后每次调用api传入第一次生成的scroll_id , 都会实现向下翻页;
缺点是在快照后新增的文档是无法被查询到的;
之后传入的"scroll":“1m”, 是延长当前快照的有效期
只要重新调用生成scroll_id的api 都会重新生成快照id;


#Scroll API
DELETE users
POST users/_doc
{"name":"user1","age":10}POST users/_doc
{"name":"user2","age":20}POST users/_doc
{"name":"user3","age":30}POST users/_doc
{"name":"user4","age":40}POST users/_search?scroll=3m
{ "size":2,"query": {"match_all": {}}
}
POST users/_doc
{"name":"user5","age":50}
POST users/_doc
{"name":"user7","age":70}
POST _search/scroll
{"scroll":"1m","scroll_id":"FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFmgzaWpSLWdfVFpLRktpRXRPdjdNRkEAAAAAAAAKhhZYcWtYWU92LVNTdXpmVjQtRjFnSDJn"
}

适用场景

普通场景下查询最新的数据, 普通的查询就行;

需要全部文档, 进行数据导出 , 处理的 , 适用scroll api , 不要求实时性, 又节省了性能;

需要分页的, 使用分页参数 , 需要深分页的, 加上search after设置, 节省性能, 保证数据实时性;

bucket&metric 聚合分析与嵌套聚合

es 聚合类似sql的count, group;

准备样本

DELETE /employees
PUT /employees/
{"mappings" : {"properties" : {"age" : {"type" : "integer"},"gender" : {"type" : "keyword"},"job" : {"type" : "text","fields" : {"keyword" : {"type" : "keyword","ignore_above" : 50}}},"name" : {"type" : "keyword"},"salary" : {"type" : "integer"}}}
}PUT /employees/_bulk
{ "index" : {  "_id" : "1" } }
{ "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 }
{ "index" : {  "_id" : "2" } }
{ "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000}
{ "index" : {  "_id" : "3" } }
{ "name" : "Tran","age":25,"job":"Web Designer","gender":"male","salary":18000 }
{ "index" : {  "_id" : "4" } }
{ "name" : "Rivera","age":26,"job":"Web Designer","gender":"female","salary": 22000}
{ "index" : {  "_id" : "5" } }
{ "name" : "Rose","age":25,"job":"QA","gender":"female","salary":18000 }
{ "index" : {  "_id" : "6" } }
{ "name" : "Lucy","age":31,"job":"QA","gender":"female","salary": 25000}
{ "index" : {  "_id" : "7" } }
{ "name" : "Byrd","age":27,"job":"QA","gender":"male","salary":20000 }
{ "index" : {  "_id" : "8" } }
{ "name" : "Foster","age":27,"job":"Java Programmer","gender":"male","salary": 20000}
{ "index" : {  "_id" : "9" } }
{ "name" : "Gregory","age":32,"job":"Java Programmer","gender":"male","salary":22000 }
{ "index" : {  "_id" : "10" } }
{ "name" : "Bryant","age":20,"job":"Java Programmer","gender":"male","salary": 9000}
{ "index" : {  "_id" : "11" } }
{ "name" : "Jenny","age":36,"job":"Java Programmer","gender":"female","salary":38000 }
{ "index" : {  "_id" : "12" } }
{ "name" : "Mcdonald","age":31,"job":"Java Programmer","gender":"male","salary": 32000}
{ "index" : {  "_id" : "13" } }
{ "name" : "Jonthna","age":30,"job":"Java Programmer","gender":"female","salary":30000 }
{ "index" : {  "_id" : "14" } }
{ "name" : "Marshall","age":32,"job":"Javascript Programmer","gender":"male","salary": 25000}
{ "index" : {  "_id" : "15" } }
{ "name" : "King","age":33,"job":"Java Programmer","gender":"male","salary":28000 }
{ "index" : {  "_id" : "16" } }
{ "name" : "Mccarthy","age":21,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : {  "_id" : "17" } }
{ "name" : "Goodwin","age":25,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : {  "_id" : "18" } }
{ "name" : "Catherine","age":29,"job":"Javascript Programmer","gender":"female","salary": 20000}
{ "index" : {  "_id" : "19" } }
{ "name" : "Boone","age":30,"job":"DBA","gender":"male","salary": 30000}
{ "index" : {  "_id" : "20" } }
{ "name" : "Kathy","age":29,"job":"DBA","gender":"female","salary": 20000}

metric聚合只关注指标, 不返回具体列表, 节省资源
agg中自定义聚合名称, 使用的函数和目标字段


POST employees/_search
{"size": 0,"aggs": {"max_salary": {"max": {"field": "salary"}},"min_salary":{"min": {"field": "salary"}}}
}

展示多个指标

POST employees/_search
{"size": 0,"aggs": {"stats_salary": {"stats": {"field": "salary"}}}
}

bucket聚合可以分组继续处理, 其中terms 聚合类型可对数字,时间类型的进行分组, 可以指定size返回terms的结果;
对keyword进行分词, 将结果进行去重并展示每个桶中的数量,
如果是text类型, 需要开启fielddata , 而且会对字段值进行分词, 一般用不到;
为了加快terms的结果, 可以在映射中将eager_global_ordinals打开


POST employees/_search
{"size": 0,"aggs": {"job_agg": {"terms": {"field": "job.keyword"}}}
}

cardinality函数类似count(distinct)

POST employees/_search
{"size": 0,"aggs": {"cardinate": {"cardinality": {"field": "job.keyword"}}}
}

嵌套聚合可以对桶中的结果再次聚合分析
例首先工作分桶, 再聚合通过top_hits函数获取年龄最大的前3个员工

POST employees/_search
{"size": 0, "aggs": {"job_agg": {"terms": {"field": "job.keyword","size": 10},"aggs": {"age_agg": {"top_hits": {"sort": [{"age":{"order":"desc"}}], "size": 3}}}}}
}

eager_global_ordinals
eager_global_ordinals占用堆内存, 提升term性能查询速度, 但降低文档索引的速度, 可以随时开启和关闭

PUT my-index-000001/_mapping
{"properties": {"tags": {"type": "keyword","eager_global_ordinals": true}}
}

区间函数range

POST employees/_search
{"size": 0,"aggs": {"salary_agg": {"range": {"field": "salary","ranges": [{"to": 10000},{"from": 10000,"to": 20000},{"key": "大于10000", "from": 20000}]}}}
}

直方图函数 , 区间5000, 来统计不同区间工资的人数

POST employees/_search
{"size": 0,"aggs": {"salary_agg_histogram": {"histogram": {"field": "salary","interval": 5000,"extended_bounds": {"min": 0,"max": 100000}}}}
}

多次嵌套, 不同岗位的不同性别的收入情况


POST employees/_search
{"size": 0,"aggs": {"job_agg": {"terms": {"field": "job.keyword"}, "aggs": {"gender_agg": {"terms": {"field": "gender"},"aggs": {"salary_agg": {"stats": {"field": "salary"}}}}}}}
}

pipeline聚合分析

es pipeline
使用上一个聚合的结果进行分析 , 分为sibling pipeline, 和parent pipeline
例求各岗位平均工资中最小的是哪个岗位
使用sibling pipeline , pipeline min_salary_by_job的结果和job_agg是同级的


POST employees/_search
{"size": 0,"aggs": {"job_agg": {"terms": {"field": "job.keyword","size": 20},"aggs": {"salary_avg": {"avg": {"field": "salary"}}}},"min_salary_by_job":{"min_bucket": {"buckets_path": "job_agg>salary_avg"}}}
}

结果

求各年龄的平均工资及差值
使用parent pipeline, 结果包含在平均值中avg中

POST employees/_search
{"size": 0,"aggs": {"age_agg": {"histogram": {"min_doc_count": 1, "field": "age","interval": 1},"aggs": {"avg_salary": {"avg": {"field": "salary"}},"derivative_avg_salary":{"derivative": {"buckets_path": "avg_salary"}}}}}
}

结果截图

范围作用与排序

query 和filter 限定聚合范围


POST employees/_search
{"size": 0,"query": {"range": {"age": {"gte": 40}}},"aggs": {"job_agg": {"terms": {"field": "job.keyword","size": 10}}}
}


POST employees/_search
{"size": 0,"aggs": {"age_agg": {"filter": {"range": {"age": {"gte": 35}}},"aggs": {"job_agg": {"terms": {"field": "job.keyword","size": 10}}}},"job_agg": {"terms": {"field": "job.keyword","size": 10}}}
}

post filter

POST employees/_search
{"aggs": {"job_agg": {"terms": {"field": "job.keyword","size": 10}}},"post_filter": {"term": {"job.keyword": "Dev Manager"}}
}

global 改变了聚合的范围, 将平均值范围提升到全部数据


POST employees/_search
{"size": 0,"query": {"range": {"age": {"gte": 40}}},"aggs": {"job_agg": {"terms": {"field": "job.keyword","size": 10}},"avg_salary":{"global": {},"aggs": {"avg_salary": {"avg": {"field": "salary"}}}}}
}

排序

指定聚合结果的排序

POST employees/_search
{"size": 0,"aggs": {"job_agg": {"terms": {"field": "job.keyword","order": [{"_count": "asc"}, {"_key": "desc"}]}}}
}

指定额外的指标进行排序

POST employees/_search
{"size": 0,"aggs": {"job_agg": {"terms": {"field": "job.keyword","order": [{"job_avg_agg": "asc"}]},"aggs": {"job_avg_agg": {"avg": {"field": "salary"}}}}}
}

自聚合中的指标进行排序


POST employees/_search
{"size": 0,"aggs": {"job_agg": {"terms": {"field": "job.keyword","order": {"stats_agg.min":"desc"}},"aggs": {"stats_agg": {"stats": {"field": "salary"}}}}}
}

聚合分析的原理及精确度问题

由于分布式存储 , 分片的上的数据会有倾斜 , 在进行聚合分桶时 , 桶的元素数量可能会不准确;

因为c桶的数量在每个分片上不均匀, 导致d桶没有在结果中
解决方法将分片数设置为1 ,
或者增加每个分片独立计算的top个数, shard_size默认为size*1.5+10
show_term_doc_count_error打开了解自己的结果是否精确


GET kibana_sample_data_flights/_search
{"size": 0,"aggs": {"weather": {"terms": {"field":"OriginWeather","size":5,"show_term_doc_count_error":true}}}
}

对象及nested对象

es与关系型数据库相反 , 将数据进行扁平化处理, 将数据存储在一个文档中, 提升查询性能, 不需要关联表, 但不利于文档频繁更新

DELETE blog
# 设置blog的 Mapping
PUT /blog
{"mappings": {"properties": {"content": {"type": "text"},"time": {"type": "date"},"user": {"properties": {"city": {"type": "text"},"userid": {"type": "long"},"username": {"type": "keyword"}}}}}
}

# 插入一条 Blog 信息
PUT blog/_doc/1
{"content":"I like Elasticsearch","time":"2019-01-01T00:00:00","user":{"userid":1,"username":"Jack","city":"Shanghai"}
}
#查询子对象属性
POST blog/_search
{"query": {"bool": {"must": [{"match": {"content": "Elasticsearch"}},{"match": {"user.username": "Jack"}}]}}
}

actors对象类型为object, 属性会被列成数组进行存储 , 所以下例查询会查询到结果, 本来是没有这个人

DELETE my_movies# 电影的Mapping信息
PUT my_movies
{"mappings" : {"properties" : {"actors" : {"properties" : {"first_name" : {"type" : "keyword"},"last_name" : {"type" : "keyword"}}},"title" : {"type" : "text","fields" : {"keyword" : {"type" : "keyword","ignore_above" : 256}}}}}
}# 写入一条电影信息
POST my_movies/_doc/1
{"title":"Speed","actors":[{"first_name":"Keanu","last_name":"Reeves"},{"first_name":"Dennis","last_name":"Hopper"}]
}POST my_movies/_search
{"query": {"bool": {"must": [{"match": {"actors.first_name": "Keanu"}},{"match": {"actors.last_name": "Hopper"}}]}}
}

nested重新映射
两个actors对象将单独建立lucene索引, 查询时使用join, 所以下列人名将不会查询出结果


DELETE my_movies
# 创建 Nested 对象 Mapping
PUT my_movies
{"mappings" : {"properties" : {"actors" : {"type": "nested","properties" : {"first_name" : {"type" : "keyword"},"last_name" : {"type" : "keyword"}}},"title" : {"type" : "text","fields" : {"keyword":{"type":"keyword","ignore_above":256}}}}}
}POST my_movies/_doc/1
{"title":"Speed","actors":[{"first_name":"Keanu","last_name":"Reeves"},{"first_name":"Dennis","last_name":"Hopper"}]
}POST my_movies/_search
{"query": {"nested": {"path": "actors","query": {"bool": {"must": [{"match": {"actors.first_name": "Keanu"}},{"match": {"actors.last_name": "Hopper"}}]}}}}
}

对nested 类型聚合


POST my_movies/_search
{"size":0,"aggs": {"actors": {"nested": {"path": "actors"},"aggs": {"name_agg": {"terms": {"field": "actors.first_name","size": 10}}}}}
}

文档的父子关系

nested的父子文档视为一个整体, 修改子文档,需要重新索引整个父子文档, 适合子文档读多写少;
es父子文档可以实现父文档和子文档单独维护 , 不需要重新索引整个父子文档, 适合子文档写多读少;
join方式查询

mapping中指定父子对应关系, 索引父子文档时需要声明父子文档类型, 子文档需要声明父文档id并保证id唯一性,子文档需要指定route参数保证父子文档在一个分片上


DELETE my_blogs# 设定 Parent/Child Mapping
PUT my_blogs
{"settings": {"number_of_shards": 2},"mappings": {"properties": {"blog_comments_relation": {"type": "join","relations": {"blog": "comment"}},"content": {"type": "text"},"title": {"type": "keyword"}}}
}#索引父文档
PUT my_blogs/_doc/blog1
{"title":"Learning Elasticsearch","content":"learning ELK @ geektime","blog_comments_relation":{"name":"blog"}
}#索引父文档
PUT my_blogs/_doc/blog2
{"title":"Learning Hadoop","content":"learning Hadoop","blog_comments_relation":{"name":"blog"}
}#索引子文档
PUT my_blogs/_doc/comment1?routing=blog1
{"comment":"I am learning ELK","username":"Jack","blog_comments_relation":{"name":"comment","parent":"blog1"}
}#索引子文档
PUT my_blogs/_doc/comment2?routing=blog2
{"comment":"I like Hadoop!!!!!","username":"Jack","blog_comments_relation":{"name":"comment","parent":"blog2"}
}#索引子文档
PUT my_blogs/_doc/comment3?routing=blog2
{"comment":"Hello Hadoop","username":"Bob","blog_comments_relation":{"name":"comment","parent":"blog2"}
}

查询父文档不会带有子文档信息, 需要使用Parent Id 查询才有子文档

#根据父文档ID查看
GET my_blogs/_doc/blog2

# Parent Id 查询
POST my_blogs/_search
{ "query": {"parent_id": {"type": "comment","id": "blog2"}}
}

#根据父文档ID查看
GET my_blogs/_doc/comment3?routing=blog2

修改子文档指定父文档id , 之后查询子文档和父文档, 可以看到父文档的version未变,子文档改变

PUT my_blogs/_doc/comment3?routing=blog2
{"comment":"Hello Hadoop??","username":"Bob","blog_comments_relation":{"name":"comment","parent":"blog2"}
}

updateByQuery和reIndex api

修改了映射后, 不会影响到旧的文档,新的文档可以生效, 想要解决这个问题, 可以使用updateByQuery或reIndex api

updateByQuery : 仅限增加新的字段, 将原索引的文档重新索引一遍 ;

reIndex : 将原索引迁移到新索引, 可以修改字段类型, 增加分片数, 支持跨集群,
需要原索引有source字段, 可以通过插叙条件部分迁移
可以设置op_type,将不冲突的id迁移到新索引
跨集群需要在es配置文件中设置白名单
可以设置异步, 并通过GET _tasks?detailed=true&actions=*reindex查询reindex的进度

updateByQuery
例在索引文档后增加子字段, 使用english分词器


DELETE blogs/# 写入文档
PUT blogs/_doc/1
{"content":"Hadoop is cool","keyword":"hadoop"
}# 查看 Mapping
GET blogs/_mapping# 修改 Mapping，增加子字段，使用英文分词器
PUT blogs/_mapping
{"properties": {"content": {"type": "text","fields": {"english": {"type": "text","analyzer": "english"}}}}
}# 写入文档
PUT blogs/_doc/2
{"content":"Elasticsearch rocks","keyword":"elasticsearch"
}

使用english子字段查询, 可以查到id2的文档, id1的文档因为使用的是默认分词器, 索引中的token为Hadoop, 所以使用英文分词器查不到
通过explain是使用英文分词器小写的hadoop查询

使用updateByQuery重新索引, 可以查到了


# 查询新写入文档
POST blogs/_search
{"query": {"match": {"content.english": "Elasticsearch"}},"explain": true
}# 查询 Mapping 变更前写入的文档
POST blogs/_search
{"query": {"match": {"content.english": "Hadoop"}},"explain": true
}POST blogs/_update_by_query

reindex

创建新的索引映射,将原keyword字段由text改为keyword类型, reindex 转移数据到新索引, 新的索引可以使用keyword字段进行聚合

DELETE blogs_fix# 创建新的索引并且设定新的Mapping
PUT blogs_fix/
{"mappings": {"properties" : {"content" : {"type" : "text","fields" : {"english" : {"type" : "text","analyzer" : "english"}}},"keyword" : {"type" : "keyword"}}    }
}# Reindx API
POST  _reindex
{"source": {"index": "blogs"},"dest": {"index": "blogs_fix"}
}POST blogs_fix/_search
{"size": 0,"aggs":{"keyword_agg":{"terms": {"field": "keyword","size": 10}}}
}

ingestPipeline 和 painlessScript

在文档被索引前, 对文档进行处理, 使用ingestPipeline, 与logstash功能类似,可以对文档的值进行切分,增加新的字段,日期格式化,大小写转换, 减少了logstash部署的架构复杂度;

ingestPipeline

DELETE tech_blogs#Blog数据，包含3个字段，tags用逗号间隔
PUT tech_blogs/_doc/1
{"title":"Introducing big data......","tags":"hadoop,elasticsearch,spark","content":"You konw, for big data"
}# 测试split tags
POST _ingest/pipeline/_simulate
{"pipeline": {"description": "split tags","processors": [{"split": {"field": "tags","separator": ","}}]},"docs": [{"_index": "index","_id": "id","_source": {"title": "Introducing big data......","tags": "hadoop,elasticsearch,spark","content": "You konw, for big data"}},{"_index": "index","_id": "idxx","_source": {"title": "Introducing cloud computering","tags": "openstack,k8s","content": "You konw, for cloud"}}]
}
# 增加字段验证
POST _ingest/pipeline/_simulate
{"pipeline": {"description": "split and set","processors": [{"split": {"field": "tags","separator": ","},"set": {"field": "views","value": 0}}]},"docs": [{"_index": "index","_id": "id","_source": {"title": "Introducing big data......","tags": "hadoop,elasticsearch,spark","content": "You konw, for big data"}},{"_index": "index","_id": "idxx","_source": {"title": "Introducing cloud computering","tags": "openstack,k8s","content": "You konw, for cloud"}}]
}#增加一个pipeline
PUT _ingest/pipeline/blog_pipeline
{"description": "a blog pipeline","processors": [{"split": {"field": "tags","separator": ","}},{"set":{"field": "views","value": 0}}]
}#查看pipeline
GET _ingest/pipeline/blog_pipeline#测试效果
POST _ingest/pipeline/blog_pipeline/_simulate
{"docs": [{"_source": {"title": "Introducing cloud computering","tags": "openstack,k8s","content": "You konw, for cloud"}}]
}
#插入一条未使用pipeline
PUT tech_blogs/_doc/1
{"title":"Introducing big data......","tags":"hadoop,elasticsearch,spark","content":"You konw, for big data"
}#插入一条使用pipeline的
PUT tech_blogs/_doc/2?pipeline=blog_pipeline
{"title": "Introducing cloud computering","tags": "openstack,k8s","content": "You konw, for cloud"
}#查询
POST tech_blogs/_search
{}

因为tags在索引id为2的文档的时被更正成了数组,索引的映射仍为string, 在更新索引时, 需要排除数组类型的数据

#增加update_by_query的条件
POST tech_blogs/_update_by_query?pipeline=blog_pipeline
{"query": {"bool": {"must_not": {"exists": {"field": "views"}}}}
}

painless script
使用脚本来处理文档字段的值, 支持javaapi的语法, 类似 string.contains(), 6.0后支持java ;
es脚本编译开销大, 会被缓存起来, 默认缓存100个, 所以性能高;


DELETE tech_blogsPUT tech_blogs/_doc/1
{"title":"Introducing big data......","tags":"hadoop,elasticsearch,spark","content":"You konw, for big data","views":0
}
#更新时使用脚本
POST tech_blogs/_update/1
{"script": {"source": "ctx._source.views += params.new_views","params": {"new_views":100}}
}# 查看views计数
POST tech_blogs/_search
{}


#保存脚本在 Cluster State
POST _scripts/update_views
{"script":{"lang": "painless","source": "ctx._source.views += params.new_views"}
}#使用保存的脚本
POST tech_blogs/_update/1
{"script": {"id": "update_views","params": {"new_views":1000}}
}

获取字段值并加上随机数

GET tech_blogs/_search
{"script_fields": {"rnd_views": {"script": {"lang": "painless","source": """java.util.Random rnd = new Random();doc['views'].value+rnd.nextInt(1000);"""}}},"query": {"match_all": {}}
}

数据建模

对一个字段来说 , 需要考虑以下几点设置mapping
es mapping参数
字段类型
使用全文搜索的使用text;
需要聚合,filter查询的使用keyword;
需要特殊分词的使用子字段;

结构化数据
数值类型尽量贴合原类型 , 使用byte , 不要用long
枚举类型和数字类型也用keyword

检索
不需要检索 , enable设置false;
不需要检索 , 设置index为false;
需要检索的, 但仅用于聚合的设置norms为false, 减少磁盘使用, norms增加算分精确性,占用额外的空间;

聚合排序
不需要检索聚合排序的设置enable 为false;
不需要排序和聚合的, 即便是keyword也设置docvalue或fielddata为false;
更新频繁用于聚合的keyword类型字段设置eager_global_ordinals设置为true;(利用缓存,提高termAgg性能)

额外存储
关闭_source时,开启store可以单独保存此字段 ,节省空间降低io , 关闭了_source无法进行reindex和update, 一般考虑增加压缩比;

案例

封面url不需要被搜索, 设置为index:false, 仍然可以聚合

# Index 一本书的信息
PUT books/_doc/1
{"title":"Mastering ElasticSearch 5.0","description":"Master the searching, indexing, and aggregation features in ElasticSearch Improve users’ search experience with Elasticsearch’s functionalities and develop your own Elasticsearch plugins","author":"Bharvi Dixit","public_date":"2017","cover_url":"https://images-na.ssl-images-amazon.com/images/I/51OeaMFxcML.jpg"
}#优化字段类型
"cover_url": {index:false, "type" : "keyword"
}

在存储大文本字段时, 关闭source , 打开其余字段的store

PUT books
{"mappings" : {"_source": {"enabled": false},"properties" : {"author" : {"type" : "keyword","store": true},...}}}
}

直接全部搜索因为没有_source,不显示文档字段信息, 需要用stored_fields指明;

#搜索，通过store 字段显示数据，同时高亮显示 conent的内容
POST books/_search
{"stored_fields": ["title","author","public_date"],"query": {"match": {"content": "searching"}},"highlight": {"fields": {"content":{}}}}

相关api
index template & dynamic template 帮助快速创建索引

index alias 将索引名指向另一个索引, 做到写时替换

update by query / reindex

数据建模最佳实践

object 反范式化 ,
nested , 字段值存在一对多关系 , 并频繁查询
parent child , 字段值存在一对多, 更新大于查询, 例如文章和评论

7.0.1 kibana对nested 和child 可视化支持不好

避免大量字段
字段mapping维护在集群cluster state 中 , 对性能有影响, 需要所有节点同步这个信息 ;
默认最大字段数是1000;
大量字段原因可能是开启自动映射dynamic mapping

nested & key value
使用这种方式解决以下场景中不断产生新字段的问题
解决了大量字段的问题 , 但是kibana对nested可视化展示不好, 也增加了查询复杂度

"person":{"name":"张三","age":15,"id":123,...
}
#改变成
"person":[{"keyName":"name","value":"张三"}
{"keyName":"age","value":15}
...
]

DELETE cookie_service
#使用 Nested 对象，增加key/value
PUT cookie_service
{"mappings": {"properties": {"cookies": {"type": "nested","properties": {"name": {"type": "keyword"},"dateValue": {"type": "date"},"keywordValue": {"type": "keyword"},"IntValue": {"type": "integer"}}},"url": {"type": "text","fields": {"keyword": {"type": "keyword","ignore_above": 256}}}}}
}##写入数据，使用key和合适类型的value字段
PUT cookie_service/_doc/1
{"url": "www.google.com","cookies": [{"name": "username","keywordValue": "tom"},{"name": "age","intValue": 32}]
}PUT cookie_service/_doc/2
{"url": "www.amazon.com","cookies": [{"name": "login","dateValue": "2019-01-01"},{"name": "email","IntValue": 32}]
}POST cookie_service/_search
{"query": {"nested": {"path": "cookies","query": {"bool": {"filter": [{"term": {"cookies.name": "age"}},{"range": {"cookies.intValue": {"gte": 30}}}]}}}}
}

避免通配符

将针对一个字段的模糊查询, 改为多个字段的精确查询
案例针对版本号的查询
原本software_version的查询, 需要用7.1.*的方式, 将映射改为object存储每个字段存储版本号的1位

PUT softwares/_doc/1
{"software_version":"7.1.0"
}

DELETE softwares
# 优化,使用inner object
PUT softwares/
{"mappings": {"_meta": {"software_version_mapping": "1.1"},"properties": {"version": {"properties": {"display_name": {"type": "keyword"},"hot_fix": {"type": "byte"},"marjor": {"type": "byte"},"minor": {"type": "byte"}}}}}
}PUT softwares/_doc/1
{"version":{"display_name":"7.1.0","marjor":7,"minor":1,"hot_fix":0  }
}
PUT softwares/_doc/2
{"version":{"display_name":"7.2.0","marjor":7,"minor":2,"hot_fix":0  }
}PUT softwares/_doc/3
{"version":{"display_name":"7.2.1","marjor":7,"minor":2,"hot_fix":1  }
}POST softwares/_search
{"query": {"bool": {"filter": [{"match": {"version.marjor": 7}},{"match": {"version.minor": 2}}]}}
}

避免空值引发的聚合不准
给空值赋予默认值 ,

案例中查询的平均值为5.0 ,正确的是5/2 = 2.5,空值影响了结果

PUT ratings/_doc/1
{"rating":5
}
PUT ratings/_doc/2
{"rating":null
}
POST ratings/_search
{"size": 0,"aggs": {"avg": {"avg": {"field": "rating"}}}
}

mapping针对null做处理 , 在聚合结果正确为6/2=3.0, 虽然文档2中记录的还是null

# Not Null 解决聚合的问题
DELETE ratings
PUT ratings
{"mappings": {"properties": {"rating": {"type": "float","null_value": 1.0}}}
}

将mapping信息额外管理
mapping信息是不断迭代的 , 记录映射文件版本号

# 在Mapping中加入元信息，便于管理
PUT softwares/
{"mappings": {"_meta": {"software_version_mapping": "1.0"}}
}

总结

结构化搜索
term查询使用keyword精确查询
match为全文搜索,进行分词

querycontext vs filterContext

filter不进行算分 , 利用缓存
bool查询filter 和must not 都是filter

搜索算分
tf / idf ; 字段boosting来控制算分结果, 例negative对包含like的文档降权,

单字符串多字符串查询
bestField 返回单字段最高分值
mostField 结合多字段分值
crossField

搜索相关性
多语言 : 设置多个子字段使用不同的分词器
search template 分离代码逻辑和搜索dsl, 不需要改动客户端查询代码, 完成查询逻辑的替换

聚合
bucket / metric / pipeline

分页
from size
导出使用 scroll api ,避免深分页

分布式存储
文档id hash路由 , 主分片不能修改;

分片内部原理
segment / transaction log / refresh / merge

分布式查询和聚合分析的内部机制
query then fetch : idf 不是基于全局 , 而是基于分片 , 因此数据量少的时候, 通过指定shard size让更多的分片数据参与计算;
数据建模
es处理管理关系; 数据建模常见步骤 ; 建模实践;

建模相关的工具
index template / dynamic template / ingest node / update by query / reindex / index alias

最佳实践
避免过多的字段 , 避免wildcard模糊查询 /

term 和 match的理解
以下搜索的情况及原因

DELETE test#默认分词器, 存储成 hello 和 world
PUT test/_doc/1
{"content":"Hello World"
}#standard
GET _analyze
{"analyzer": "standard","text": "Hello World"
}#1使用分词器 有
POST test/_search
{"profile": "true","query": {"match": {"content": "Hello World"}}
}
#2使用分词器 有
POST test/_search
{"profile": "true","query": {"match": {"content": "hello world"}}
}
#3不用分词器 ,搜索keyword子字段,有, 因为keyword存储的是原文本
POST test/_search
{"profile": "true","query": {"match": {"content.keyword": "Hello World"}}
}
#4不用分词器 , 精确匹配 没有
POST test/_search
{"profile": "true","query": {  "match": {"content.keyword": "hello world"}}
}#5不用分词器 ,没有, 存储的文本为小写的, 用大写的搜索无法匹配
POST test/_search
{"profile": "true","query": {"term": {"content": "Hello World"}}
}
#6 不用分词器 , 没有,
POST test/_search
{"profile": "true","query": {"term": {"content": "hello world"}}
}POST test/_search
{"profile": "true","query": {"term": {"content.keyword": "Hello World"}}
}#standard
GET _analyze
{"analyzer": "standard","text": "2 running Quick brown-foxes leap over lazy dogs in the summer evening."
}

测试
生产环境中使用alias ;
分片数大于1时, 指定shardSize 提升 term聚合精准度;
聚合 cardinality 求出有多少分类