莎翁作品集词频统计分析python

结论：读书万卷，不如巴掌大词典一本！

代码很简单：

import collections
import re
from pyecharts import Pie,Bar,WordCloud,Page
import webbrowser#将莎翁作品文本读入
t=""
with open("shakespeare0.txt","r",encoding="utf-8") as f:t=f.read()#将单词词尾缩写还原为原来单词# to find the 's following the pronouns. re.I is refers to ignore case
pat_is = re.compile("(it|he|she|that|this|there|here)(\'s)", re.I)
# to find the 's following the letters
pat_s = re.compile("(?<=[a-zA-Z])\'s")
# to find the ' following the words ending by s
pat_s2 = re.compile("(?<=s)\'s?")
# to find the abbreviation of not
pat_not = re.compile("(?<=[a-zA-Z])n\'t")
# to find the abbreviation of would
pat_would = re.compile("(?<=[a-zA-Z])\'d")
# to find the abbreviation of will
pat_will = re.compile("(?<=[a-zA-Z])\'ll")
# to find the abbreviation of am
pat_am = re.compile("(?<=[I|i])\'m")
# to find the abbreviation of are
pat_are = re.compile("(?<=[a-zA-Z])\'re")
# to find the abbreviation of have
pat_ve = re.compile("(?<=[a-zA-Z])\'ve")t = pat_is.sub(r"\1 is", t)
t = pat_s.sub("", t)
t = pat_s2.sub("", t)
t = pat_not.sub(" not", t)
t = pat_would.sub(" would", t)
t = pat_will.sub(" will", t)
t = pat_am.sub(" am", t)
t = pat_are.sub(" are", t)
t = pat_ve.sub(" have", t)
t = t.replace('\'', ' ')#将单词统一转化为小写
t=t.lower()
#滤除所有非单词字符
pattern=re.compile(r"\W+")
t=re.sub(pattern," ",t)#通过单词间空格来分词
ts=t.split(" ")#计算总词数
word_count=len(ts)print("total words:",word_count)#使用python内建集合模块collection来统计词频
tc=collections.Counter(ts)
#不重复单词的个数
word_count_unique=len(tc)
#非重复单词占比
word_unique_percent=word_count_unique/word_count*100
other_percent=100-word_unique_percent
other_word_count=word_count-word_count_uniqueprint("unique word count:",word_count_unique)#前n个最高频词汇
most_common_words=tc.most_common(100)print("most common words:",most_common_words)#数据可视化
page=Page()
html_filename="shakespeare_word_count.html"
#柱状图
bar=Bar("词汇统计","莎翁作品")
bar.add("词汇数目",["总词汇数（未去重）","总词汇数（已去重）"],[word_count,word_count_unique],is_label_show=True)
page.add_chart(bar)
#饼图
pie=Pie("词汇数量")
pie.add("",["","去重词汇"],[other_word_count,word_count_unique],is_label_show=False)
#pie.print_echarts_options()
pie._option["color"]=["lightgreen","red"]
page.add_chart(pie)x=[]
y=[]
for wc in most_common_words:x.append(wc[0])y.append(wc[1])#词频柱形图
bar0=Bar("各词词频","排行前20位",width=2000)
bar0.add("词频",x[:20],y[:20],is_label_show=True)
page.add_chart(bar0)#词频柱形图
bar1=Bar("各词词频","排行前20-40位",width=2000)
bar1.add("词频",x[20:40],y[20:40],is_label_show=True)
page.add_chart(bar1)#词云
word_cloud=WordCloud(width=1000,height=1000)
word_cloud.add("高频词汇",x,y,word_size_range=[10,200])
page.add_chart(word_cloud)
#渲染输出pyecharts网页
page.render(html_filename)
webbrowser.open(html_filename)

莎翁作品集词频统计分析python相关推荐

淘宝用户行为统计分析-python
淘宝用户行为统计分析-Python 一分析背景二分析目的三分析思路四数据处理 4.1 数据导入 4.2 数据清洗 4.3 数据转换五统计分析 5.1 用户习惯 5.2 销售规律 5. ...
python分析红楼梦出现的虚词词频统计,python对红楼梦的每一章节进行词频统计
python对红楼梦的每一章节进行词频统计 python对红楼梦的每一章节进行词频统计 import jieba f=open("G:\\红楼梦.txt","r" ...
使用Python对PDF文件进行词频统计分析并保存到CSV文件中
PDF转TXT文件要安装的库 pdfminer3k 分词处理要安装的库 jieba # -*- coding:utf-8 import sys import importlib importlib.r ...
python英文词频统计-Python实现统计英文文章词频的方法分析
本文实例讲述了Python实现统计英文文章词频的方法.分享给大家供大家参考,具体如下: 应用介绍: 统计英文文章词频是很常见的需求,本文利用python实现. 思路分析: 1.把英文文章的每个单词放到 ...
金融统计分析python论文_Python量化投资远程班
2017年6月10-11,17-18日四天讲师介绍: 王小川,同济大学管理学博士,MATLAB技术论坛管理团队核心成员,经管之家(原人大经济论坛)数据分析与挖掘课程培训Python主讲导师,证券从业 ...
201671010457 朱石景实验四《英文文本词频统计分析》结对项目报告
项目内容这个作业属于哪个课程西北师范大学软件工程作业要求实验四软件工程结对项目本次实验我的GitHub地址点击进入课程学习目标熟悉软件开发整体流程,提升自身能力任务一点评信息 ...
金融统计分析python论文_比较好写的本科金融专业论文题目本科金融专业论文题目怎么取...
为论文写作提供[100道]比较好写的本科金融专业论文题目,海量本科金融专业相关论文题目,包括专科与本科以及硕士论文题目,解决您的本科金融专业论文题目怎么取的相关难题! 一.比较好写的本科金融专业论文题 ...
金融统计分析python论文_金融统计分析论文选题.docx
金融统计分析论文选题 1货币流通速度测算中国货币流通速度测算结果图1-1 货币流通速度的分析通过对货币流通速度的测算,在图1-1中发现我国的货币流通速度在逐年下降,在1993年到1995年的货币 ...
金融统计分析python论文_金融统计分析论文
1102010227 依据上市证劵公司经营业绩进行投资合理性问题探究 --基于因子分析法摘要: 证券投资方法总的可以分为两种,依据基本面或者技术分析.当下已有相当多的文献采用因子分析法对某一行业的 ...

莎翁作品集词频统计分析python

莎翁作品集词频统计分析python相关推荐

最新文章

热门文章