开发环境:win7,8,10,python3+

python模块:requestes,bs4,matplotlib,jieba,wordcloud,PIL,numpy,random

实现的功能和思路:

(1)打开豆瓣《狂暴巨兽》评论区,根据html结构捕获三个信息:

一,每账号的评分等级为5星、4星、3星、2星、1星;

二,每个账号的评论留言;

三,跳转到下个评论页面的http链接

(2)获取所有的信息后对信息进行处理:

一,计算出每个星级的总数和一共多少账户进行了评级

二、将所有的评论内容放在一起,处理评论中的空格和其他不规范形式

(3)用matplotlib绘制评分等级占比的饼图,用jieba进行分词处理,用wordcloud生成词云图

最终结果:

实现步骤和代码:

import requests
from bs4 import BeautifulSoup
import random
import matplotlib.pyplot as plt
import jieba
from wordcloud import WordCloud
import PIL
import numpy as npagents = ["Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:17.0; Baiduspider-ads) Gecko/17.0 Firefox/17.0",
    "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
    "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9b4) Gecko/2008030317 Firefox/3.0b4",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
    "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
    "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
    "Mozilla/5.0 (Windows; U; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; BIDUBrowser 7.6)",
    "Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko",
    "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0",
    "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.99 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.3; Win64; x64; Trident/7.0; Touch; LCJB; rv:11.0) like Gecko",
    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
    "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
    "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
    "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
    "Mozilla/2.02E (Win95; U)",
    "Mozilla/3.01Gold (Win95; I)",
    "Mozilla/4.8 [en] (Windows NT 5.1; U)",
    "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
    "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
    "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
    "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522  (KHTML, like Gecko) Safari/419.3",
    "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
    "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
    "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10  (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
    "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
    "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5  (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
]                                                        # 反爬虫对策——浏览器列表
heads = {                                                # ip代理池获取时用的url的headers
        'User-Agent': random.choice(agents),
        'ue': 'utf-8'  # 设置翻译支持中文
    }
list_all = []                                            # 存放每个评论页面所有的信息的列表

# 获取代理ip池
def get_ip_list():urlip = 'http://www.xicidaili.com/nn/'
    html = requests.get(urlip, headers=heads).textsoup = BeautifulSoup(html, 'html.parser')ips = soup.find_all('tr')ip_list = []for i in range(1, len(ips)):ip_info = ips[i]tds = ip_info.find_all('td')ip_list.append(tds[1].text + ':' + tds[2].text)return ip_list# 从ip代理池随机选取一个ip返回
def get_random_ip():ip_list = get_ip_list()proxy_list = []for ip in ip_list:proxy_list.append('http://' + ip)proxy_ip = random.choice(proxy_list)proxies = {'http': proxy_ip}return proxies# 用于评论页面html获取的url的headers
heade = {'User-Agent': random.choice(agents),
        'proxies': get_random_ip(),                                 # 反爬虫策略——每次从ip池中随机调取一个ip
        'ue': 'utf-8'  # 设置翻译支持中文
    }# 获取一个评论页面的所有html文本信息
def get_html(url):response = requests.get(url, params=heade)response.encoding = 'utf-8'
    html = response.textreturn html# 获取下一页评论区信息的url
def get_url(html):bs = BeautifulSoup(html, 'html.parser')url_list = bs.find_all('div', attrs={'id': "paginator"})if len(url_list) > 0:url_next_part = url_list[0].find('a', attrs={'class': "next"})['href']url_next = 'https://movie.douban.com/subject/26430636/comments' + url_next_partreturn url_next

# 获取评论每种等级和影评信息
def get_star_and_conments(url_comment):comment_html = get_html(url_comment)bs = BeautifulSoup(comment_html, 'html.parser')comment_list = bs.find_all('div', attrs={'class': "comment-item"})comment_all =''
    five_star = 0
    four_star = 0
    three_star = 0
    two_star = 0
    one_star = 0

    for comment in comment_list:comments = (comment.find('p')).textcomment_all += commentsspan = (comment.find_all('span')[4])['class']if span[0] == 'allstar50':five_star += 1
        elif span[0] == 'allstar40':four_star += 1
        elif span[0] == 'allstar30':three_star += 1
        elif span[0] == 'allstar20':two_star += 1
        elif span[0] == 'allstar10':one_star += 1
    return [comment_all.strip().replace('\n', '').replace('         ', ';'), [five_star, four_star, three_star, two_star, one_star]]# 创建循环调用实现完成一个页面信息获取后,自动获取下个页面信息的函数
def get_all(url):all_list = get_star_and_conments(url)comment_url = get_url(get_html(url))list_all.append(all_list)if comment_url != None:get_all(comment_url)# 主函数运行程序
if __name__ == '__main__':url = 'https://movie.douban.com/subject/26430636/comments?start=0&limit=20&sort=new_score&status=P&percent_type='
    get_all(url)comment = ''
    five_star = 0
    four_star = 0
    three_star = 0
    two_star = 0
    one_star = 0

    for i in list_all:comment += i[0]five_star += i[1][0]four_star += i[1][1]three_star += i[1][2]two_star += i[1][3]one_star += i[1][4]all = five_star + four_star + three_star + two_star + one_star# print(comment, one_star, two_star, three_star, four_star, five_star)
    with open('狂暴巨兽影评.txt', 'w', encoding='utf-8') as f_obj:                    # 生成影评文本
        f_obj.write(comment)
# 调用matplotlib绘制饼图,并生成png图片
    labels = 'one star', 'two star', 'three star', 'four star', 'five star'
    faces = [one_star*1.00/all, two_star*1.00/all, three_star*1.00/all, four_star*1.00/all, five_star*1.00/all]explore = [0, 0, 0, 0, 0.1]colors = ['red', 'yellow', 'blue', 'green', 'orange']plt.axis(aspect=1)patches, l_text, p_text = plt.pie(x=faces, labels=labels, explode=explore, colors=colors, autopct='%3.1f%%', shadow=True, labeldistance=1.1, startangle=90, pctdistance=0.6)for t in l_text:t.set_size = (10)for t in p_text:t.set_size = (30)plt.axis('equal')plt.title('the comment', fontsize=10)plt.savefig('狂暴巨兽-pie-comment.png')plt.show()
# 调用wordcloud生成词云图并保存为ipg
    path = r'F:\文档类\python36\图像处理\STXINWEI.TTF'
    alien_mask = np.array(PIL.Image.open(r'F:\meizitu\54552016a-08-10\03.jpg'))wc = WordCloud(font_path=path, background_color='white', margin=5, width=1800, height=800, mask=alien_mask, max_words=2000, max_font_size=60, random_state=42)a = []words = list(jieba.cut(comment))for word in words:if len(word) > 1:a.append(word)txt = r' '.join(a)wc = wc.generate(txt)wc.to_file('狂暴巨兽词云.jpg')

总结:

我们分析url=https://movie.douban.com/subject/26430636/comments?start=0&limit=20&sort=new_score&status=P&percent_type=

之中“26430636”为电影的代表,将其换做其他的编号就可以读取和生成其他电影的matplotlib和wordcloud制作评分图和词云图

python爬取豆瓣《狂暴巨兽》评分影评,matplotlib和wordcloud制作评分图和词云图相关推荐

  1. 用python爬取豆瓣影评及影片信息(评论时间、用户ID、评论内容)

    爬虫入门:python爬取豆瓣影评及影片信息:影片评分.评论时间.用户ID.评论内容 思路分析 元素定位 完整代码 豆瓣网作为比较官方的电影评价网站,有很多对新上映影片的评价,不多说,直接进入正题. ...

  2. 完全小白篇-用python爬取豆瓣电影影评

    完全小白篇-用python爬取豆瓣影评 打开豆瓣电影 随机电影的所有影评网页 跳转逻辑 分析影评内容获取方法 逐一正则提取影评 针对标签格式过于多样的处理 针对提出请求的频率的限制 存储方式(本次sq ...

  3. Python爬取豆瓣《哪吒之魔童降世》影评

    这几天朋友圈,微博都被<哪吒之魔童降世>这部电影刷屏了,有人说它是"国漫之光",上映4天,票房已经突破9亿了.口碑上,影片自点映开分以来,口碑连续十天稳居所有在映影片榜 ...

  4. python爬取豆瓣影评理论依据_我用Python爬取了豆瓣的影评

    使用Python爬取豆瓣的影评,比爬取网易云简单,因为不需要设置特定的headers,关于网易云说几句,很难爬取,对请求头有着严格的要求,前几年那会还好些. 爬取结果分为:用户名,评价的星级,评论的内 ...

  5. Python爬取豆瓣网影评展示

    Python爬取豆瓣网影评展示 需要的库文件 requests beautifulsoup wordcloud jieba matplotlib 本文思想 1.访问指定的网页 #获取指定url的内容 ...

  6. python爬取豆瓣电影评论_python 爬取豆瓣电影评论,并进行词云展示及出现的问题解决办法...

    def getHtml(url): """获取url页面""" headers = {'User-Agent':'Mozilla/5.0 ( ...

  7. Python爬取豆瓣Top250电影中2000年后上映的影片信息

    Python爬取豆瓣Top250电影中2000年后上映的影片信息 前言 双十一前加在京东购物车的一个东西,价格330,Plus会员用券后差不多310.双十一当天打开看了下399,还得去抢满300减10 ...

  8. Python爬取豆瓣电影top250的电影信息

    Python爬取豆瓣电影top250的电影信息 前言 一.简介 二.实例源码展示 小结 前言 相信很多小伙伴在学习网络爬虫时,老师们会举一些实例案例讲解爬虫知识,本文介绍的就是经典爬虫实际案例–爬取豆 ...

  9. python爬取豆瓣排行榜电影(静态爬取)(二次爬取)

    目录 python爬取豆瓣排行榜电影(静态爬取) 获取网站url 获取网站headers get请求访问url BeautifulSoup解析网站 爬取html数据 完整代码 python爬取豆瓣排行 ...

最新文章

  1. 这10项创新技术正在引领零售业数字化转型
  2. 在Linux系统里安装Virtual Box的详细步骤 1
  3. 滑动拼图验证码操作步骤:_拼图项目:延期的后果
  4. python数据分析与机器学习(Numpy,Pandas,Matplotlib)
  5. mongo 主从数据不同步
  6. Android-HandlerThread详解
  7. 6.pragma pack
  8. php字符串中删除字符串函数,PHP实现删除字符串中任何字符的函数
  9. 计算机科学技术学习引论
  10. 大学生对于外卖和食堂之间的抉择的调查报告 新生研讨课校内调查
  11. html5怎么把文字竖排,艺术字竖排文字怎么设置
  12. 圣诞素材网站推荐 这几个网站超多免费可商用素材
  13. 《禅者的初心》读书笔记(2)
  14. Hexo博客标题栏背景颜色设置美化
  15. win10快速打开网络适配器选项
  16. “烫烫烫烫烫烫烫烫烫烫烫烫烫...
  17. 张飞硬件课程第六部:开关电源(上)
  18. 5G取代光纤宽带,可能吗?
  19. CSDN 富文本编辑器和 Markdown 编辑器使用 Word 支持的 LaTx 语法公式
  20. HCNA静态路由配置

热门文章

  1. 智能商贸系统02-完成基本框架,高级查询和删除
  2. 2021-2027中国岩藻依聚糖市场现状研究分析与发展前景预测报告
  3. python venv文件夹_Python虚拟环境Venv
  4. [转] 蜗居经典对白
  5. 很哇塞的网页特效之摩天轮相册
  6. 苹果新机人脸识别有玄机:已婚男士千万别买!啥都藏不住!
  7. 《沉默的云》.读书笔记(一)
  8. 骁龙麒麟天玑苹果cpu性能排行2022
  9. 2.mysql底层架构和sql执行流程
  10. Hive在执行插入数据等job任务出错