NLP基础——python的jieba用于词类分割用法总结(1)

import jiebaseg_listDef = jieba.cut("我在学习自然语言处理")
seg_listAll = jieba.cut("我在学习自然语言处理", cut_all=True)
print("Default mode:"+" ".join(seg_listDef))
print("All mode:"+" ".join(seg_listAll))

jieba中的cut用于做词语分割，函数有三个参数常用，分别是
cut(sentence, cut_all=False, HMM=True) 第一个参数传
入需要进行词语分割的字符串，第二个参数用来指定分割的方法
默认为False，即不进行精确分割,反之为True，即进行精确分割
凡是能组成词语的全部分割出来。另一个参数为隐马尔可夫模型
(Hidden Markov Model)后续文章介绍.

Building prefix dict from the default dictionary ...
Loading model from cache
Default mode:我 在 学习 自然语言 处理
All mode:我 在 学习 自然 自然语言 语言 处理
Loading model cost 0.578 seconds.
Prefix dict has been built successfully.

import jiebaseg_list = jieba.cut_for_search("我在学习自然语言处理")
print(" ".join(seg_list))

cut_for_search(sentence, HMM=True)函数用两个参数，一
个是需要进行词语分割的字符串，另一个为隐马尔可夫模型,cut_
for_search为基于搜索引擎的精细分割。以上两种方法返回的都是生成器类型，也可以通过next()方法打
印出结果
import jieba
import sysseg_list = jieba.cut_for_search("我在学习自然语言处理")
while(True):try:print(next(seg_list))except StopIteration:sys.exit()

而jieba.lcut_for_search和jieba.lcut返回的都是list类型
可以直接进行打印输出。

import jiebaseg_list = jieba.lcut_for_search("如果放到旧字典中将会出错")
print(seg_list)Building prefix dict from the default dictionary ...
Loading model from cache
['如果', '放到', '旧', '字典', '中将', '会', '出错']
Loading model cost 0.590 seconds.
Prefix dict has been built successfully.这种情况下在进行词语分割时容易混淆词语，实际“中”和“将”分
别是两个不同的词。为此可以用suggest_freq(segment,
tune=False)方法
Parameter:
- segment : The segments that the word is expected
to be cut into.If the word should be treated as a
whole,use a str.
- tune : If True, tune the word frequency.
对于参数tune,如果设置True,我们就增大segment中词的出现
频率。import jiebajieba.suggest_freq(('中', '将'), True)
seg_list = jieba.lcut_for_search("如果放到旧字典中将会出错")
print(seg_list)Building prefix dict from the default dictionary ...
Loading model from cache
Loading model cost 0.569 seconds.
Prefix dict has been built successfully.
['如果', '放到', '旧', '字典', '中', '将', '会', '出错']

jieba.analyse.extract_tags(sentence,
topK=20, withWeight=False, allowPOS=(),
withFlag=False) 函数可以用来做关键词提取，topK用来设定
函数返回前topK个出现频率最高的词语，allowPOS()用来指定
词性(参数可传'ns', 'n', 'vn', 'v','nr'，分别代表“地名”
“名词”“名动词”“动词”“人名”，除了上述五个词性以外还有很多
这里不一一列举了),withWeight可以用来设置是否返回词语的
权重,如果为True,则最后返回元组形式,即(词语,权重)的元组。import jieba.analyse as analyselines = open("白夜行.txt").read()
seg_list = analyse.extract_tags(lines,topK=20,withWeight=False,allowPOS=())
print(seg_list)Building prefix dict from the default dictionary ...
Loading model from cache
Loading model cost 0.657 seconds.
Prefix dict has been built successfully.
['笹垣', '桐原', '雪穗', '今枝', '友彦', '利子','什么', '没有', '典子', '知道', '男子', '唐泽雪穗','警察', '菊池', '筱冢', '一成', '这么', '松浦', '不是', '千都']import jieba.analyse as analyse
lines = open("白夜行.txt").read()
seg_list = analyse.extract_tags(lines,topK=20,withWeight=True,allowPOS=())
print(seg_list)Building prefix dict from the default dictionary ...
Loading model from cache
Loading model cost 0.759 seconds.
Prefix dict has been built successfully.
[('笹垣', 0.0822914672578027),
('桐原', 0.07917333917389914),
('雪穗', 0.07619078187625225),
('今枝', 0.060328999884221086),
('友彦', 0.05992228752545106),
('利子', 0.041188814819915855),
('什么', 0.028355297812861044),
('没有', 0.0282050104733996),
('典子', 0.025758449388768555),
('知道', 0.021181785348317664),
('男子', 0.021159435305523867),
('唐泽雪穗', 0.01992890557973146),
('警察', 0.018198774503613253),
('菊池', 0.01816537645564464),
('筱冢', 0.01803091457213799),
('一成', 0.01796642218172475),
('这么', 0.016991657412780303),
('松浦', 0.016132923564544516),
('不是', 0.015944699687736586),
('千都', 0.015726211205774485)]

我们也可以加入停用词，从而避免切割出来的词语包含一些无用
词，例如“的”“了”“什么”“这么”“是”“不是”等。import jieba.analyse as analyselines = open("白夜行.txt").read()
analyse.set_stop_words("stopwords.txt")
seg_list = analyse.extract_tags(lines,topK=20,withWeight=False,allowPOS=())
print(seg_list)Building prefix dict from the default dictionary ...
Loading model from cache
Loading model cost 0.743 seconds.
Prefix dict has been built successfully.
['笹垣', '桐原', '雪穗', '今枝', '友彦', '利子',
'典子', '唐泽雪穗', '警察', '菊池', '筱冢', '一成',
'松浦', '千都', '高宫', '奈美江', '正晴', '美佳',
'雄一', '康晴']set_stop_words(stop_words_path)函数传入停用词文件路径
最终结果显然不包含像“什么”“没有”“知道”“不是”等非关键性词语。

jieba.analyse.textrank(sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'), withFlag=False)
参数同以上几个函数，textrank算法的基本思想：
(1)将待切割的文本进行分词操作
(2)按照词语之间的关系构建无向带权图(图论的相关知识)
(3)计算图中结点的PageRank。import jieba.analyse as analyselines = open("白夜行.txt").read()
analyse.set_stop_words("stopwords.txt")
seg_list = analyse.textrank(lines,topK=20,withWeight=True,allowPOS=('n','ns','vn','v'))
for each in seg_list:print("Key words:"+each[0]+"\t"+"weights:"+str(each[1]))Building prefix dict from the default dictionary ...
Loading model from cache
Loading model cost 0.586 seconds.
Prefix dict has been built successfully.
Key words:桐原    weights:1.0
Key words:可能    weights:0.42444673809670785
Key words:利子    weights:0.36847047979727526
Key words:应该    weights:0.3365721550231226
Key words:时候    weights:0.3278732402186668
Key words:警察    weights:0.3120355440367427
Key words:东西    weights:0.3068897401798211
Key words:开始    weights:0.3015519959941887
Key words:调查    weights:0.29838940592155194
Key words:典子    weights:0.29588671242198666
Key words:公司    weights:0.2939855813808517
Key words:电话    weights:0.27845709742538927
Key words:不会    weights:0.27350278630982
Key words:看到    weights:0.27028179300492206
Key words:发现    weights:0.2681890733271942
Key words:房间    weights:0.2672507877051219
Key words:工作    weights:0.2661521099652389
Key words:声音    weights:0.24633054809460497
Key words:露出    weights:0.22866657032979934
Key words:认为    weights:0.21634682021764443

NLP基础——python的jieba用于词类分割用法总结(1)相关推荐

NLP之情感分析：基于python编程(jieba库)实现中文文本情感分析(得到的是情感评分)之全部代码
NLP之情感分析:基于python编程(jieba库)实现中文文本情感分析(得到的是情感评分)之全部代码目录全部代码相关文章 NLP之情感分析:基于python编程(jieba库)实现中文文本情 ...
NLP之TEA：基于python编程(jieba库)实现中文文本情感分析(得到的是情感评分)之全部代码
NLP之TEA:基于python编程(jieba库)实现中文文本情感分析(得到的是情感评分)之全部代码目录全部代码相关文章 NLP之TEA:基于python编程(jieba库)实现中文文本情感分 ...
NLP之TEA：基于python编程(jieba库)实现中文文本情感分析(得到的是情感评分)
NLP之TEA:基于python编程(jieba库)实现中文文本情感分析(得到的是情感评分) 目录输出结果设计思路相关资料 1.关于代码 2.关于数据集关于留言 1.留言内容的注意事项 2.如 ...
基于python中jieba包的中文分词中详细使用
基于python中jieba包的中文分词中详细使用(一) 01.前言之前的文章中也是用过一些jieba分词但是基本上都是处于皮毛,现在就现有的python环境中对其官方文档做一些自己的理解以及具体的 ...
python行业中性_【建投金工丁鲁明团队经典回顾】：零基础python代码策略模型实战...
原标题:[建投金工丁鲁明团队经典回顾]:零基础python代码策略模型实战编者按本文<零基础python代码策略模型实战>,属于大数据选股领域,报告发布时间为2018年3月8日. 内 ...
零基础python必背代码-编程零基础应当如何开始学习 Python？
学Python,最高效的方法其实就是自学+自问+实战(海伦泰勒学习法的精简版). 本文分为3大板块,不仅要帮你0基础入门,还专门写了进阶内容. 进阶内容中这51个Python的秘密,你可不一定知道. ...
站长在线零基础Python完全自学教程20：在Python中使用正则表达式完全解读
欢迎你来到站长学堂,学习站长在线出品的在线课程<零基础 Python完全自学教程>今天给大家分享的是第20课< 在Python中使用正则表达式完全解读>.本节课是一个大课,我分 ...
nlp基础—8.隐马尔科夫模型(HMM)分词实现
文章目录引言 HMM分词实现理论部分传送门: nlp基础-7.隐马尔可夫模型(HMM算法) 数据代码链接见:https://gitee.com/lj857335332/hmm-for-word-s ...
python使用jieba模块进行文本分析和搜索引擎推广“旅行青蛙”数据分析实战
目录 1 需要导入的模块 2 中文分词基础步骤 2.1 载入数据 2.2 分词 2.3 分词后的数据转回文本 2.4 保存分词后的文本为文本文件 3 添加自定义词典 3.1 方法1:直接定义词典列表 ...

NLP基础——python的jieba用于词类分割用法总结(1)

NLP基础——python的jieba用于词类分割用法总结(1)相关推荐

最新文章

热门文章