8.2.2-elasticsearch内置分词器之whitespace/stop

ES默认提供了八种内置的analyzer,针对不同的场景可以使用不同的analyzer;

1、whitespace analyzer

1.1、whitespace类型及分词效果

whitespace analyzer在处理文本时以空格字符为区分进行分词,该分词器在中文分词上基本不被使用;

POST _analyze
{"analyzer": "whitespace","text": "How does this work?"
}//结果返回
{"tokens" : [{"token" : "How","start_offset" : 0,"end_offset" : 3,"type" : "word","position" : 0},{"token" : "does","start_offset" : 4,"end_offset" : 8,"type" : "word","position" : 1},{"token" : "this","start_offset" : 9,"end_offset" : 13,"type" : "word","position" : 2},{"token" : "work?","start_offset" : 14,"end_offset" : 19,"type" : "word","position" : 3}]
}

以上句子通过分词之后得到的关键词为:
[How, does, this, work?]

1.2、whitespace analyzer的组成定义

序号	子构件	构件说明
1	Tokenizer	whitespace tokenizer

如果希望自定义一个与whitespace类似的analyzer,只需要在在自定义analyzer时指定type为whitesapce,其它的可以按照需要进行配置(char filter/filter),如下示例:

//自定义analyzer
PUT custom_rebuild_whitespace_analyzer_index
{"settings": {"analysis": {"analyzer": {"rebuild_wihtespace_analyzer":{"tokenizer":"whitespace","filter":[]}}}}
}//需要明确指定analyzer,否则默认为standard,返回结果与上面相同
POST custom_rebuild_whitespace_analyzer_index/_analyze
{"analyzer": "rebuild_wihtespace_analyzer", "text": "How does this work?"
}

2、stop analyzer

2.1、stop类型及分词效果

stop analyzer与simple analyzer功能一样,不同之处在于支持停用词,默认情况下使用_english_停用词;

POST _analyze
{"analyzer": "stop","text": "How does this work?"
}

以上句子通过分词之后得到的关键词为:
[How, does, work]

2.2、stop类型可配置参数

序号	参数	参数说明
1	stopwords	预定义的停用词类型,例如_english_,或者是包含停用词的数组结构,默认值为_english_;
2	stopwords_path	停用词文件路径;

自定义配置stopwords示例:

//analyzer参数设置
PUT custom_rebuild_stop_analyzer_index
{"settings": {"analysis": {"analyzer": {"rebuild_stop_analyzer":{"type":"stop","stopwords":["the","work"]}}}}
}//测试analyzer
POST custom_rebuild_stop_analyzer_index/_analyze
{"analyzer": "rebuild_stop_analyzer","text": "How does this work?"
}//结果返回
{"tokens" : [{"token" : "how","start_offset" : 0,"end_offset" : 3,"type" : "word","position" : 0},{"token" : "does","start_offset" : 4,"end_offset" : 8,"type" : "word","position" : 1},{"token" : "this","start_offset" : 9,"end_offset" : 13,"type" : "word","position" : 2}]
}

以上句子通过分词之后得到的关键词为:
[How, does, this]

2.3、stop analyzer的组成定义

序号	子构件	构件说明
1	Tokenizer	lowercase tokenizer
2	Token filters	stop token filter

如果希望自定义一个与stop类似的analyzer,只需要在在自定义analyzer时指定type为stop,其它的可以按照需要进行配置(char filter/filter),如下示例:

//自定义analyzer
PUT custom_stop_analyzer_conf_index
{"settings": {"analysis": {"analyzer": {"rebuild_stop_analyzer":{"tokenizer":"lowercase","filter":["english_stop"]}},"filter": {"english_stop":{"type":"stop","stopwords":"_english_"}}}}
}//测试analyzer
POST custom_stop_analyzer_conf_index/_analyze
{"analyzer": "rebuild_stop_analyzer","text": "How does this work?"
}//结果返回
{"tokens" : [{"token" : "how","start_offset" : 0,"end_offset" : 3,"type" : "word","position" : 0},{"token" : "does","start_offset" : 4,"end_offset" : 8,"type" : "word","position" : 1},{"token" : "work","start_offset" : 14,"end_offset" : 18,"type" : "word","position" : 3}]
}

以上句子通过分词之后得到的关键词为:
[How, does, this]