elasticsearch 基础 —— Field Collapsing字段折叠

允许根据字段值折叠搜索结果。通过按折叠键选择顶部排序文档来完成折叠。例如，下面的查询检索每个用户的最佳推文，并按喜欢的数量对它们进行排序。

GET /twitter/_search
{"query": {"match": {"message": "elasticsearch"}},"collapse" : {"field" : "user" ①},"sort": ["likes"],  ②"from": 10  ③
}

	使用“user”字段折叠结果集
	按喜欢的数量对顶级文档进行排序
	定义第一个折叠结果的偏移量

响应中的总命中数表示没有折叠的匹配文档的数量。不同组的总数是未知的。

用于折叠的字段必须是 keyword 或numeric的已激活doc_values的字段

折叠仅应用于顶部匹配，不会影响聚合

Expand collapse result

也可以使用inner_hits选项展开每个折叠的顶部命中。

GET /twitter/_search
{"query": {"match": {"message": "elasticsearch"}},"collapse" : {"field" : "user", ①"inner_hits": {"name": "last_tweets", ②"size": 5, ③"sort": [{ "date": "asc" }]  ④},"max_concurrent_group_searches": 4  ⑤},"sort": ["likes"]
}

	使用“user”字段折叠结果集
	用于响应中内部命中部分的名称
	每个折叠键检索的inner_hits数
	如何对每个组内的文档进行排序
	允许每组检索inner_hits`的并发请求数

有关支持的选项的完整列表和响应的格式，请参阅内部命中。

inner_hits每次折叠命中也可以请求多个。当您想要获得折叠命中的多个表示时，这可能很有用。

GET /twitter/_search
{"query": {"match": {"message": "elasticsearch"}},"collapse" : {"field" : "user",  ①"inner_hits": [{"name": "most_liked",  "size": 3, ②"sort": ["likes"]},{"name": "most_recent", "size": 3, ③"sort": [{ "date": "asc" }]}]},"sort": ["likes"]
}

	使用“user”字段折叠结果集
	返回给用户最喜欢的三条推文
	返回给用户的三条最新推文

通过为inner_hit响应中返回的每个折叠命中的每个请求发送附加查询来完成组的扩展。如果您有太多的组和/或inner_hit请求，这可能会显着减慢速度。

所述max_concurrent_group_searches请求参数可用于控制允许在这个阶段并行搜索的最大数目。默认值基于数据节点数和默认搜索线程池大小。

collapse不能与滚动，重新结合或搜索结合使用

Second level of collapsing 二级折叠

还支持并应用第二级折叠inner_hits。例如，以下请求查找每个国家/地区的最高得分推文，并且在每个国家/地区内查找每个用户的得分最高的推文。

二级折叠是不允许的inner_hits

GET /twitter/_search
{"query": {"match": {"message": "elasticsearch"}},"collapse" : {"field" : "country","inner_hits" : {"name": "by_location","collapse" : {"field" : "user"},"size": 3}}
}

响应：

{..."hits": [{"_index": "twitter","_type": "_doc","_id": "9","_score": ...,"_source": {...},"fields": {"country": ["UK"]},"inner_hits":{"by_location": {"hits": {...,"hits": [{..."fields": {"user" : ["user124"]}},{..."fields": {"user" : ["user589"]}},{..."fields": {"user" : ["user001"]}}]}}}},{"_index": "twitter","_type": "_doc","_id": "1","_score": ..,"_source": {...},"fields": {"country": ["Canada"]},"inner_hits":{"by_location": {"hits": {...,"hits": [{..."fields": {"user" : ["user444"]}},{..."fields": {"user" : ["user1111"]}},{..."fields": {"user" : ["user999"]}}]}}}},....]
}

实例

下面来看看具体的例子，就知道怎么回事了，使用起来很简单。

先准备索引和数据，这里以菜谱为例，name：菜谱名，type 为菜系，rating 为用户的累积平均评分

PUT recipes
POST /recipes/type/_mapping
{"properties": {"name":{"type": "text"},"rating":{"type": "float"},"type":{"type": "keyword"}}
}
/recipes/_bulk
{ "index":  { "_index": "recipes", "_type": "type"}}
{"name":"清蒸鱼头","rating":1,"type":"湘菜"}
{ "index":  { "_index": "recipes", "_type": "type"}}
{"name":"剁椒鱼头","rating":2,"type":"湘菜"}
{ "index":  { "_index": "recipes", "_type": "type"}}
{"name":"红烧鲫鱼","rating":3,"type":"湘菜"}
{ "index":  { "_index": "recipes", "_type": "type"}}
{"name":"鲫鱼汤（辣）","rating":3,"type":"湘菜"}
{ "index":  { "_index": "recipes", "_type": "type"}}
{"name":"鲫鱼汤（微辣）","rating":4,"type":"湘菜"}
{ "index":  { "_index": "recipes", "_type": "type"}}
{"name":"鲫鱼汤（变态辣）","rating":5,"type":"湘菜"}
{ "index":  { "_index": "recipes", "_type": "type"}}
{"name":"广式鲫鱼汤","rating":5,"type":"粤菜"}
{ "index":  { "_index": "recipes", "_type": "type"}}
{"name":"鱼香肉丝","rating":2,"type":"川菜"}
{ "index":  { "_index": "recipes", "_type": "type"}}
{"name":"奶油鲍鱼汤","rating":2,"type":"西菜"}

现在我们看看普通的查询效果是怎么样的，搜索关键字带“鱼”的菜，返回3条数据

POST recipes/type/_search
{"query": {"match": {"name": "鱼"}},"size": 3
}

全是湘菜，我的天，最近上火不想吃辣，这个第一页的结果对我来说就是垃圾，如下：

{"took": 2,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 9,"max_score": 0.26742277,"hits": [{"_index": "recipes","_type": "type","_id": "AVoESHYF_OA-dG63Txsd","_score": 0.26742277,"_source": {"name": "鲫鱼汤（变态辣）","rating": 5,"type": "湘菜"}},{"_index": "recipes","_type": "type","_id": "AVoESHXO_OA-dG63Txsa","_score": 0.19100356,"_source": {"name": "红烧鲫鱼","rating": 3,"type": "湘菜"}},{"_index": "recipes","_type": "type","_id": "AVoESHWy_OA-dG63TxsZ","_score": 0.19100356,"_source": {"name": "剁椒鱼头","rating": 2,"type": "湘菜"}}]}
}

我们再看看，这次我想加个评分排序，大家都喜欢的是那些，看看有没有喜欢吃的，执行查询：

POST recipes/type/_search
{"query": {"match": {"name": "鱼"}},"sort": [{"rating": {"order": "desc"}}],"size": 3
}

结果稍微好点了，不过3个里面2个是湘菜，还是有点不合适，结果如下：

{"took": 1,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 9,"max_score": null,"hits": [{"_index": "recipes","_type": "type","_id": "AVoESHYF_OA-dG63Txsd","_score": null,"_source": {"name": "鲫鱼汤（变态辣）","rating": 5,"type": "湘菜"},"sort": [5]},{"_index": "recipes","_type": "type","_id": "AVoESHYW_OA-dG63Txse","_score": null,"_source": {"name": "广式鲫鱼汤","rating": 5,"type": "粤菜"},"sort": [5]},{"_index": "recipes","_type": "type","_id": "AVoESHX7_OA-dG63Txsc","_score": null,"_source": {"name": "鲫鱼汤（微辣）","rating": 4,"type": "湘菜"},"sort": [4]}]}
}

现在我知道了，我要看看其他菜系，这家不是还有西餐、广东菜等各种菜系的么，来来，帮我每个菜系来一个菜看看，换 terms agg 先得到唯一的 term 的 bucket，再组合 top_hits agg，返回按评分排序的第一个 top hits，有点复杂，没关系，看下面的查询就知道了：

GET recipes/type/_search
{"query": {"match": {"name": "鱼"}},"sort": [{"rating": {"order": "desc"}}],"aggs": {"type": {"terms": {"field": "type","size": 10},"aggs": {"rated": {"top_hits": {"sort": [{"rating": {"order": "desc"}}], "size": 1}}}}}, "size": 0,"from": 0
}

看下面的结果，虽然 json 结构有点复杂，不过总算是我们想要的结果了，湘菜、粤菜、川菜、西菜都出来了，每样一个，不重样：

{"took": 4,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 9,"max_score": 0,"hits": []},"aggregations": {"type": {"doc_count_error_upper_bound": 0,"sum_other_doc_count": 0,"buckets": [{"key": "湘菜","doc_count": 6,"rated": {"hits": {"total": 6,"max_score": null,"hits": [{"_index": "recipes","_type": "type","_id": "AVoESHYF_OA-dG63Txsd","_score": null,"_source": {"name": "鲫鱼汤（变态辣）","rating": 5,"type": "湘菜"},"sort": [5]}]}}},{"key": "川菜","doc_count": 1,"rated": {"hits": {"total": 1,"max_score": null,"hits": [{"_index": "recipes","_type": "type","_id": "AVoESHYr_OA-dG63Txsf","_score": null,"_source": {"name": "鱼香肉丝","rating": 2,"type": "川菜"},"sort": [2]}]}}},{"key": "粤菜","doc_count": 1,"rated": {"hits": {"total": 1,"max_score": null,"hits": [{"_index": "recipes","_type": "type","_id": "AVoESHYW_OA-dG63Txse","_score": null,"_source": {"name": "广式鲫鱼汤","rating": 5,"type": "粤菜"},"sort": [5]}]}}},{"key": "西菜","doc_count": 1,"rated": {"hits": {"total": 1,"max_score": null,"hits": [{"_index": "recipes","_type": "type","_id": "AVoESHY3_OA-dG63Txsg","_score": null,"_source": {"name": "奶油鲍鱼汤","rating": 2,"type": "西菜"},"sort": [2]}]}}}]}}
}

上面的实现方法，前面已经说了，可以做，有局限性，那看看新的字段折叠法如何做到呢，查询如下，加一个 collapse 参数，指定对那个字段去重就行了，这里当然对菜系“type”字段进行去重了：

GET recipes/type/_search
{"query": {"match": {"name": "鱼"}},"collapse": {"field": "type"},"size": 3,"from": 0
}

结果很理想嘛，命中结果还是熟悉的那个味道（和查询结果长的一样嘛），如下：

{"took": 1,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 9,"max_score": null,"hits": [{"_index": "recipes","_type": "type","_id": "AVoDNlRJ_OA-dG63TxpW","_score": 0.018980097,"_source": {"name": "鲫鱼汤（微辣）","rating": 4,"type": "湘菜"},"fields": {"type": ["湘菜"]}},{"_index": "recipes","_type": "type","_id": "AVoDNlRk_OA-dG63TxpZ","_score": 0.013813315,"_source": {"name": "鱼香肉丝","rating": 2,"type": "川菜"},"fields": {"type": ["川菜"]}},{"_index": "recipes","_type": "type","_id": "AVoDNlRb_OA-dG63TxpY","_score": 0.0125863515,"_source": {"name": "广式鲫鱼汤","rating": 5,"type": "粤菜"},"fields": {"type": ["粤菜"]}}]}
}

我再试试翻页，把 from 改一下，现在返回了3条数据，from 改成3，新的查询如下：

{"took": 1,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 9,"max_score": null,"hits": [{"_index": "recipes","_type": "type","_id": "AVoDNlRw_OA-dG63Txpa","_score": 0.012546891,"_source": {"name": "奶油鲍鱼汤","rating": 2,"type": "西菜"},"fields": {"type": ["西菜"]}}]}
}

上面的结果只有一条了，去重之后本来就只有4条数据，上面的工作正常，每个菜系只有一个菜啊，那我不乐意了，帮我每个菜系里面多返回几条，我好选菜啊，加上参数 inner_hits 来控制返回的条数，这里返回2条，按 rating 也排个序，新的查询构造如下：

GET recipes/type/_search
{"query": {"match": {"name": "鱼"}},"collapse": {"field": "type","inner_hits": {"name": "top_rated","size": 2,"sort": [{"rating": "desc"}]}},"sort": [{"rating": {"order": "desc"}}],"size": 2,"from": 0
}

查询结果如下，完美：

{"took": 1,"timed_out": false,"_shards": {"total": 5,"successful": 5,"failed": 0},"hits": {"total": 9,"max_score": null,"hits": [{"_index": "recipes","_type": "type","_id": "AVoESHYF_OA-dG63Txsd","_score": null,"_source": {"name": "鲫鱼汤（变态辣）","rating": 5,"type": "湘菜"},"fields": {"type": ["湘菜"]},"sort": [5],"inner_hits": {"top_rated": {"hits": {"total": 6,"max_score": null,"hits": [{"_index": "recipes","_type": "type","_id": "AVoESHYF_OA-dG63Txsd","_score": null,"_source": {"name": "鲫鱼汤（变态辣）","rating": 5,"type": "湘菜"},"sort": [5]},{"_index": "recipes","_type": "type","_id": "AVoESHX7_OA-dG63Txsc","_score": null,"_source": {"name": "鲫鱼汤（微辣）","rating": 4,"type": "湘菜"},"sort": [4]}]}}}},{"_index": "recipes","_type": "type","_id": "AVoESHYW_OA-dG63Txse","_score": null,"_source": {"name": "广式鲫鱼汤","rating": 5,"type": "粤菜"},"fields": {"type": ["粤菜"]},"sort": [5],"inner_hits": {"top_rated": {"hits": {"total": 1,"max_score": null,"hits": [{"_index": "recipes","_type": "type","_id": "AVoESHYW_OA-dG63Txse","_score": null,"_source": {"name": "广式鲫鱼汤","rating": 5,"type": "粤菜"},"sort": [5]}]}}}}]}
}

好了，字段折叠介绍就到这里。