全文搜索

全文搜索两个最重要的方面是:

  • 相关性(Relevance) 它是评价查询与其结果间的相关程度,并根据这种相关程度对结果排名的一种能力,这种计算方式可以是 TF/IDF 方法、地理位置邻近、模糊相似,或其他的某些算法。
  • 分词(Analysis) 它是将文本块转换为有区别的、规范化的 token 的一个过程,目的是为了创建倒排索引以及查询倒排索引。

构造数据

PUT /test4
{"settings": {"index": {"number_of_shards": "1","number_of_replicas": "1"}},"mappings": {"properties": {"name": {"type": "text"},"age": {"type": "long"},"mail": {"type": "keyword"},"hobby": {"type": "text","analyzer":"ik_max_word"}}}
}

查看mapping:

GET /test4/_mapping

结果:

插入数据:

POST /test4/_bulk
{"index":{"_index":"test4","_type":"_doc"}}
{"name":"张三","age": 20,"mail": "111@qq.com","hobby":"羽毛球、乒乓球、足球"}
{"index":{"_index":"test4","_type":"_doc"}}
{"name":"李四","age": 21,"mail": "222@qq.com","hobby":"羽毛球、乒乓球、足球、篮球"}
{"index":{"_index":"test4","_type":"_doc"}}
{"name":"王五","age": 22,"mail": "333@qq.com","hobby":"羽毛球、篮球、游泳、听音乐"}
{"index":{"_index":"test4","_type":"_doc"}}
{"name":"赵六","age": 23,"mail": "444@qq.com","hobby":"跑步、游泳、篮球"}
{"index":{"_index":"test4","_type":"_doc"}}
{"name":"孙七","age": 24,"mail": "555@qq.com","hobby":"听音乐、看电影、羽毛球"}

单词搜索:

POST /test4/_search
{
"query":{
"match":{
"hobby":"音乐"
}
},
"highlight": {"fields": {"hobby": {}}}
}

结果:

{"took" : 691,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 0.816522,"hits" : [{"_index" : "test4","_type" : "_doc","_id" : "gj_2yXcBhFgDDNfpe9bx","_score" : 0.816522,"_source" : {"name" : "王五","age" : 22,"mail" : "333@qq.com","hobby" : "羽毛球、篮球、游泳、听音乐"},"highlight" : {"hobby" : ["羽毛球、篮球、游泳、听<em>音乐</em>"]}},{"_index" : "test4","_type" : "_doc","_id" : "hD_2yXcBhFgDDNfpe9bx","_score" : 0.816522,"_source" : {"name" : "孙七","age" : 24,"mail" : "555@qq.com","hobby" : "听音乐、看电影、羽毛球"},"highlight" : {"hobby" : ["听<em>音乐</em>、看电影、羽毛球"]}}]}
}

过程说明:

  1. 检查字段类型
    爱好 hobby 字段是一个 text 类型( 指定了IK分词器),这意味着查询字符串本身也应该被分词。

  2. 分析查询字符串 。
    将查询的字符串 “音乐” 传入IK分词器中,输出的结果是单个项 音乐。因为只有一个单词项,所以 match 查询执行的是单个底层 term 查询。

  3. 查找匹配文档 。
    用 term 查询在倒排索引中查找 “音乐” 然后获取一组包含该项的文档,本例的结果是文档:3 、5 。

  4. 为每个文档评分 。
    用 term 查询计算每个文档相关度评分 _score ,这是种将 词频(term frequency,即词 “音乐” 在相关文档的hobby 字段中出现的频率)和 反向文档频率(inverse document frequency,即词 “音乐” 在所有文档的hobby 字段中出现的频率),以及字段的长度(即字段越短相关度越高)相结合的计算方式。

多词搜索

POST /test4/_search
{
"query":{
"match":{
"hobby":"音乐 篮球"
}
},
"highlight": {"fields": {"hobby": {}}}
}

结果:

{"took" : 5,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 4,"relation" : "eq"},"max_score" : 1.319227,"hits" : [{"_index" : "test4","_type" : "_doc","_id" : "gj_2yXcBhFgDDNfpe9bx","_score" : 1.319227,"_source" : {"name" : "王五","age" : 22,"mail" : "333@qq.com","hobby" : "羽毛球、篮球、游泳、听音乐"},"highlight" : {"hobby" : ["羽毛球、<em>篮球</em>、游泳、听<em>音乐</em>"]}},{"_index" : "test4","_type" : "_doc","_id" : "hD_2yXcBhFgDDNfpe9bx","_score" : 0.816522,"_source" : {"name" : "孙七","age" : 24,"mail" : "555@qq.com","hobby" : "听音乐、看电影、羽毛球"},"highlight" : {"hobby" : ["听<em>音乐</em>、看电影、羽毛球"]}},{"_index" : "test4","_type" : "_doc","_id" : "gz_2yXcBhFgDDNfpe9bx","_score" : 0.6987338,"_source" : {"name" : "赵六","age" : 23,"mail" : "444@qq.com","hobby" : "跑步、游泳、篮球"},"highlight" : {"hobby" : ["跑步、游泳、<em>篮球</em>"]}},{"_index" : "test4","_type" : "_doc","_id" : "gT_2yXcBhFgDDNfpe9bx","_score" : 0.502705,"_source" : {"name" : "李四","age" : 21,"mail" : "222@qq.com","hobby" : "羽毛球、乒乓球、足球、篮球"},"highlight" : {"hobby" : ["羽毛球、乒乓球、足球、<em>篮球</em>"]}}]}
}

可以看到,包含了“音乐”、“篮球”的数据都已经被搜索到了。
可是,搜索的结果并不符合我们的预期,因为我们想搜索的是既包含“音乐”又包含“篮球”的用户,显然结果返回的“或”的关系。
在Elasticsearch中,可以指定词之间的逻辑关系,如下:

POST /test4/_search
{
"query":{
"match":{
"hobby":{"query": "音乐 篮球","operator": "and"
}
}
},
"highlight": {"fields": {"hobby": {}}}
}

结果:可以看到结果符合预期。

{"took" : 3,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 1,"relation" : "eq"},"max_score" : 1.319227,"hits" : [{"_index" : "test4","_type" : "_doc","_id" : "gj_2yXcBhFgDDNfpe9bx","_score" : 1.319227,"_source" : {"name" : "王五","age" : 22,"mail" : "333@qq.com","hobby" : "羽毛球、篮球、游泳、听音乐"},"highlight" : {"hobby" : ["羽毛球、<em>篮球</em>、游泳、听<em>音乐</em>"]}}]}
}

前面我们测试了“OR” 和 “AND”搜索,这是两个极端,其实在实际场景中,并不会选取这2个极端,更有可能是选取这种,或者说,只需要符合一定的相似度就可以查询到数据,在Elasticsearch中也支持这样的查询,通过minimum_should_match来指定匹配度,如:70%;

POST /test4/_search
{
"query":{
"match":{
"hobby":{"query": "游泳 羽毛球","minimum_should_match": "80%"
}
}
},
"highlight": {"fields": {"hobby": {}}}
}

结果:相似度为80%的情况下,查询到4条数据

{"took" : 4,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 4,"relation" : "eq"},"max_score" : 1.6214579,"hits" : [{"_index" : "test4","_type" : "_doc","_id" : "gj_2yXcBhFgDDNfpe9bx","_score" : 1.6214579,"_source" : {"name" : "王五","age" : 22,"mail" : "333@qq.com","hobby" : "羽毛球、篮球、游泳、听音乐"},"highlight" : {"hobby" : ["<em>羽毛球</em>、篮球、<em>游泳</em>、听音乐"]}},{"_index" : "test4","_type" : "_doc","_id" : "gD_2yXcBhFgDDNfpe9bx","_score" : 0.9608413,"_source" : {"name" : "张三","age" : 20,"mail" : "111@qq.com","hobby" : "羽毛球、乒乓球、足球"},"highlight" : {"hobby" : ["<em>羽毛球</em>、乒乓<em>球</em>、足球"]}},{"_index" : "test4","_type" : "_doc","_id" : "gT_2yXcBhFgDDNfpe9bx","_score" : 0.9134824,"_source" : {"name" : "李四","age" : 21,"mail" : "222@qq.com","hobby" : "羽毛球、乒乓球、足球、篮球"},"highlight" : {"hobby" : ["<em>羽毛球</em>、乒乓<em>球</em>、足球、篮球"]}},{"_index" : "test4","_type" : "_doc","_id" : "hD_2yXcBhFgDDNfpe9bx","_score" : 0.80493593,"_source" : {"name" : "孙七","age" : 24,"mail" : "555@qq.com","hobby" : "听音乐、看电影、羽毛球"},"highlight" : {"hobby" : ["听音乐、看电影、<em>羽毛球</em>"]}}]}
}

设置40%进行测试:

POST /test4/_search
{
"query":{
"match":{
"hobby":{"query": "游泳 羽毛球","minimum_should_match": "40%"
}
}
},
"highlight": {"fields": {"hobby": {}}}
}

结果:相似度为40%的情况下,查询到5条数据

{"took" : 6,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 5,"relation" : "eq"},"max_score" : 1.6214579,"hits" : [{"_index" : "test4","_type" : "_doc","_id" : "gj_2yXcBhFgDDNfpe9bx","_score" : 1.6214579,"_source" : {"name" : "王五","age" : 22,"mail" : "333@qq.com","hobby" : "羽毛球、篮球、游泳、听音乐"},"highlight" : {"hobby" : ["<em>羽毛球</em>、篮球、<em>游泳</em>、听音乐"]}},{"_index" : "test4","_type" : "_doc","_id" : "gz_2yXcBhFgDDNfpe9bx","_score" : 1.1349231,"_source" : {"name" : "赵六","age" : 23,"mail" : "444@qq.com","hobby" : "跑步、游泳、篮球"},"highlight" : {"hobby" : ["跑步、<em>游泳</em>、篮球"]}},{"_index" : "test4","_type" : "_doc","_id" : "gD_2yXcBhFgDDNfpe9bx","_score" : 0.9608413,"_source" : {"name" : "张三","age" : 20,"mail" : "111@qq.com","hobby" : "羽毛球、乒乓球、足球"},"highlight" : {"hobby" : ["<em>羽毛球</em>、乒乓<em>球</em>、足球"]}},{"_index" : "test4","_type" : "_doc","_id" : "gT_2yXcBhFgDDNfpe9bx","_score" : 0.9134824,"_source" : {"name" : "李四","age" : 21,"mail" : "222@qq.com","hobby" : "羽毛球、乒乓球、足球、篮球"},"highlight" : {"hobby" : ["<em>羽毛球</em>、乒乓<em>球</em>、足球、篮球"]}},{"_index" : "test4","_type" : "_doc","_id" : "hD_2yXcBhFgDDNfpe9bx","_score" : 0.80493593,"_source" : {"name" : "孙七","age" : 24,"mail" : "555@qq.com","hobby" : "听音乐、看电影、羽毛球"},"highlight" : {"hobby" : ["听音乐、看电影、<em>羽毛球</em>"]}}]}
}

结论:相似度应该多少合适,需要在实际的需求中进行反复测试,才可得到合理的值。

组合搜索

在搜索时,也可以使用过滤器中讲过的bool组合查询,示例:

POST /test4/_search
{
"query":{
"bool":{
"must":{
"match":{
"hobby":"篮球"
}
},
"must_not":{
"match":{
"hobby":"音乐"
}
},
"should":[
{
"match": {
"hobby":"游泳"
}
}
]
}
},
"highlight": {"fields": {"hobby": {}}}
}

结果:

{"took" : 4,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 1.8336569,"hits" : [{"_index" : "test4","_type" : "_doc","_id" : "gz_2yXcBhFgDDNfpe9bx","_score" : 1.8336569,"_source" : {"name" : "赵六","age" : 23,"mail" : "444@qq.com","hobby" : "跑步、游泳、篮球"},"highlight" : {"hobby" : ["跑步、<em>游泳</em>、<em>篮球</em>"]}},{"_index" : "test4","_type" : "_doc","_id" : "gT_2yXcBhFgDDNfpe9bx","_score" : 0.502705,"_source" : {"name" : "李四","age" : 21,"mail" : "222@qq.com","hobby" : "羽毛球、乒乓球、足球、篮球"},"highlight" : {"hobby" : ["羽毛球、乒乓球、足球、<em>篮球</em>"]}}]}
}

上面搜索的意思是:
搜索结果中必须包含篮球,不能包含音乐,如果包含了游泳,那么它的相似度更高。

评分的计算规则:
bool 查询会为每个文档计算相关度评分 _score , 再将所有匹配的 must 和 should 语句的分数 _score 求和,最后除以 must 和 should 语句的总数。
must_not 语句不会影响评分; 它的作用只是将不相关的文档排除。

默认情况下,should中的内容不是必须匹配的,如果查询语句中没有must,那么就会至少匹配其中一个。当然了,也可以通过minimum_should_match参数进行控制,该值可以是数字也可以的百分比。

示例:

POST /test4/_search
{
"query":{
"bool":{
"should":[
{
"match": {
"hobby":"游泳"
}
},
{
"match": {
"hobby":"篮球"
}
},
{
"match": {
"hobby":"音乐"
}
}
],
"minimum_should_match":2
}
},
"highlight": {"fields": {"hobby": {}}}
}

结果:

{"took" : 3,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 2.1357489,"hits" : [{"_index" : "test4","_type" : "_doc","_id" : "gj_2yXcBhFgDDNfpe9bx","_score" : 2.1357489,"_source" : {"name" : "王五","age" : 22,"mail" : "333@qq.com","hobby" : "羽毛球、篮球、游泳、听音乐"},"highlight" : {"hobby" : ["羽毛球、<em>篮球</em>、<em>游泳</em>、听<em>音乐</em>"]}},{"_index" : "test4","_type" : "_doc","_id" : "gz_2yXcBhFgDDNfpe9bx","_score" : 1.8336569,"_source" : {"name" : "赵六","age" : 23,"mail" : "444@qq.com","hobby" : "跑步、游泳、篮球"},"highlight" : {"hobby" : ["跑步、<em>游泳</em>、<em>篮球</em>"]}}]}
}

minimum_should_match为2,意思是should中的三个词,至少要满足2个。

权重

有些时候,我们可能需要对某些词增加权重来影响该条数据的得分。如下:
搜索关键字为“游泳篮球”,如果结果中包含了“音乐”权重为10,包含了“跑步”权重为2。

POST /test4/_search
{"query": {"bool": {"must": {"match": {"hobby": {"query": "游泳篮球","operator": "and"}}},"should": [{"match": {"hobby": {"query": "音乐","boost": 10}}},{"match": {"hobby": {"query": "跑步","boost": 2}}}]}},"highlight": {"fields": {"hobby": {}}}
}

结果:

{"took" : 3,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 2.1357489,"hits" : [{"_index" : "test4","_type" : "_doc","_id" : "gj_2yXcBhFgDDNfpe9bx","_score" : 2.1357489,"_source" : {"name" : "王五","age" : 22,"mail" : "333@qq.com","hobby" : "羽毛球、篮球、游泳、听音乐"},"highlight" : {"hobby" : ["羽毛球、<em>篮球</em>、<em>游泳</em>、听<em>音乐</em>"]}},{"_index" : "test4","_type" : "_doc","_id" : "gz_2yXcBhFgDDNfpe9bx","_score" : 1.8336569,"_source" : {"name" : "赵六","age" : 23,"mail" : "444@qq.com","hobby" : "跑步、游泳、篮球"},"highlight" : {"hobby" : ["跑步、<em>游泳</em>、<em>篮球</em>"]}}]}
}

如果不设置权重的查询结果是这样:

{"took" : 2,"timed_out" : false,"_shards" : {"total" : 1,"successful" : 1,"skipped" : 0,"failed" : 0},"hits" : {"total" : {"value" : 2,"relation" : "eq"},"max_score" : 3.630794,"hits" : [{"_index" : "test4","_type" : "_doc","_id" : "gz_2yXcBhFgDDNfpe9bx","_score" : 3.630794,"_source" : {"name" : "赵六","age" : 23,"mail" : "444@qq.com","hobby" : "跑步、游泳、篮球"},"highlight" : {"hobby" : ["<em>跑步</em>、<em>游泳</em>、<em>篮球</em>"]}},{"_index" : "test4","_type" : "_doc","_id" : "gj_2yXcBhFgDDNfpe9bx","_score" : 2.1357489,"_source" : {"name" : "王五","age" : 22,"mail" : "333@qq.com","hobby" : "羽毛球、篮球、游泳、听音乐"},"highlight" : {"hobby" : ["羽毛球、<em>篮球</em>、<em>游泳</em>、听<em>音乐</em>"]}}]}
}

elasticsearch全文搜索相关推荐

  1. Spring和Elasticsearch全文搜索整合详解

    Spring和Elasticsearch全文搜索整合详解 一.概述 ElasticSearch是一个基于Lucene的搜索服务器.它提供了一个分布式多用户能力的全文搜索引擎,基于RESTful web ...

  2. ElasticSearch 全文搜索

    ElasticSearch 全文搜索 对文档执行全文检索,包括单个或多个单词或词组查询,返回匹配条件的搜索结果. ElasticSearch 是基于Apache Lucene的搜索引擎,一个开源.免费 ...

  3. [Elasticsearch] 全文搜索 (一) - 基础概念和match查询

    全文搜索(Full Text Search) 现在我们已经讨论了搜索结构化数据的一些简单用例,是时候开始探索全文搜索了 - 如何在全文字段中搜索来找到最相关的文档. 对于全文搜索而言,最重要的两个方面 ...

  4. SpringBoot ElasticSearch 全文搜索

    2019独角兽企业重金招聘Python工程师标准>>> 一.pom.xml配置 SpringBoot版本1.5.6https://blog.csdn.net/kingice1014/ ...

  5. SpringBoot 集成 ElasticSearch 全文搜索(步骤非常的详细)

    目录 一.pom.xml配置 二.项目代码集成示例 Yml配置 存储映射实体 @Document注解 @Field注解 创建Repository 三.安装ES 下载安装ES 测试默认分词 四.Ik分词 ...

  6. 帝国CMS7.5基于es(Elasticsearch)7.x的全文搜索插件

    帝国CMS7.5基于es(Elasticsearch)7.x的全文搜索插件 - GXECMS博客 一.插件演示地址 后台演示地址:https://ecms.gxecms.cf/e/admin/inde ...

  7. php中文搜索工具,Laravel 下 TNTSearch+jieba-PHP 实现中文全文搜索

    TNTSearch+jieba-php这套组合可以在不依赖第三方的情况下实现中文全文搜索: 特别的适合博客这种小项目: 开启php扩展 pdo_sqlite sqlite3 mbstring 开始: ...

  8. python elasticsearch 入门教程(二) ---全文搜索

    python elasticsearch 入门教程(二) ---全文搜索 截止目前的搜索相对都很简单:单个姓名,通过年龄过滤.现在尝试下稍微高级点儿的全文搜索--一项 传统数据库确实很难搞定的任务. ...

  9. 全文搜索!收藏这篇Solr ElasticSearch 长文就可以搞定

    转载自  全文搜索!收藏这篇Solr ElasticSearch 长文就可以搞定 摘自:JaJian`博кē Java后端技术编者说:文章从浅到深,描述了什么是全文搜索,为什么要使用全文搜索,Solr ...

最新文章

  1. 「AI大牛」陶大程出任京东探索研究院院长!曾连续6年入选全球高被引科学家...
  2. python【Numpy科学计算库】连女朋友都会用的Numpy(真の能看懂~!)
  3. 报告!钉钉宜搭的8月总结,请查收~
  4. All is about C!
  5. 2015-03-19 Opportunity order by implementation detail
  6. Android 多级树形菜单
  7. [蓝桥杯]2018年第九届省赛真题C/C++ B组 填空+大题
  8. 京东方将首次向华为供应on-cell OLED面板 用于Mate 40系列
  9. PL/SQL学习(五)异常处理
  10. 清理SQL Server中的旧代码和未使用的对象
  11. stm32Cubemx USB虚拟串口
  12. 畅购第9天项目总结(Spring Security Oauth2 JWT)
  13. 跨站脚本攻击(XSS)及防范措施
  14. html图片加载不出来,图片相对路径问题
  15. 简述python语言的主要功能和特点_python语言的特点有哪些
  16. JavaScript 中的BOM对象
  17. 输入直角三角形的两个直角边,求三角形的周长和面积,以及两个锐角的度数
  18. 企业抖音号怎么运营矩阵?运营有何技巧?
  19. 国内镜像源使用时出现WARNING: The repository located at mirrors.aliyun.com is not a trusted or secure host and
  20. Matlab:Matlab编程语言应用之三维绘图可视化(基础知识点基本函)的使用方法简介、案例实现(三维曲线图机械阻尼振动三维等高线图等案例)之详细攻略

热门文章

  1. iOS和Android手机浏览器链接打开app store或应用市场下载软件讲解
  2. 使用asp.net开发钉钉群机器人全过程
  3. Fast R-CNN解读:单阶段,多任务完成检测
  4. 使用SFML框架打造属于自己的俄罗斯方块
  5. python kivy显示图片_python基于Kivy写一个图形桌面时钟程序代码示例
  6. TLS协议中的握手协议
  7. SQL中数据操作语言 (DML) 和数据定义语言 (DDL)
  8. 人在旅途——》2018年10月6日上海欢乐谷
  9. 实习记——《Rethink》
  10. Wi-Fi 安全协议 - EAP