Scrapy爬取新浪微博#陈情令

一、起因

最近几天陈情令大火,而#肖战#王一博等人也成为众人所熟知的对象,所以我想用Scrapy爬取演员的微博信息来分析下演员信息

二、目标

本次爬取的目标是X玖少年团肖战DAYTOY的公开基本信息，如用户昵称、头像、用户的关注、粉丝列表以及发布的微博等，这些信息抓取之后保存至Mysql,并绘制出图表

三、准备工作

请确保代理池、Cookies池已经实现并可以正常运行，安装Scrapy、PyMysql库。这里我新注册了四个微博账号，防止爬虫被封

四、爬取思路

其实使用Requests+BeautifulSoup可以很灵活的对微博进行爬取，但因为是单线程，并发效率不高，爬取的速度很慢，所以使用Scrapy+XPath可以很高效的爬取

首先我们要实现对X玖少年团肖战DAYTOY的爬取。这里采用的爬取方式是，以用户的URL为起始点，爬取他发表的微博的内容、时间、转发数、评论数、点赞数，并通过Item管道输出至Mysql数据库，然后将数据绘制成图表

五、爬取分析

新浪微博一共有三个站点：weibo.cn，m.weibo.cn，weibo.com，相对应的爬取难度也依次变高，所以为了降低爬取难度，这里我们选取的爬取站点是：https://weibo.cn，此站点是微博移动端的站点。打开该站点会跳转到登录页面，这是因为主页做了登录限制。不过我们可以用cookie绕过登录限制，直接打开某个用户详情页面，此处打开X玖少年团肖战DAYTOY的微博，链接为：https://weibo.cn/u/1792951112，即可进入其个人详情页面，如下图所示。

以下是我们需要爬取的信息

六、创建爬虫

接下来我们用Scrapy来实现这个抓取过程。首先创建一个项目，命令如下所示：

scrapy startproject weibo

进入项目中，新建一个Spider，名为weibocn，命令如下所示：

scrapy genspider weibocn weibo.cn

七、创建Items

class WeiboItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()text_fanart = scrapy.Field()    #原创内容text_transfrom = scrapy.Field() #转发内容comment = scrapy.Field()    #评论数like = scrapy.Field()   #点赞数transmit = scrapy.Field()   #转发数time = scrapy.Field()   #发表时间_from = scrapy.Field()    #来自的客户端

八、编写spider

class WeibocnSpider(scrapy.Spider):name = 'weibocn'allowed_domains = ['weibo.cn']url = 'https://weibo.cn/u/1792951112?page={page}'# start_urls = ['https://weibo.cn/u/1792951112']def transform(self, cookies):cookie_dict = {}cookies = cookies.replace(' ', '')list = cookies.split(';')for i in list:keys = i.split('=')[0]values = i.split('=')[1]cookie_dict[keys] = valuesreturn cookie_dict#Override start_requests()#此方法相当于 requests.get()方法def start_requests(self):for page in range(1,65):yield scrapy.Request(url=self.url.format(page=page),callback=self.parse)# 此方法中的response相当于response = requests.get()def parse(self, response):# print(response.text)item = WeiboItem()for p in response.xpath("//div[@class='c'and @id]"):try:text_transfrom = "".join(p.xpath("./div/text()").re(r'[\u4e00-\u9fa5]'))text_fanart = "".join(p.xpath("./div/span[@class='ctt']/text()").extract())item['text_fanart'] = text_fanartitem['text_transfrom'] = text_transfromitem['like'] = "".join(p.xpath("./div/a").re(r'赞\[[0-9]*?\]')).replace('赞[','').replace(']','')item['transmit'] = "".join(p.xpath("./div/a").re(r'转发\[[0-9]*?\]')).replace('转发[', '').replace(']', '')item['comment'] = "".join(p.xpath("./div/a").re(r'评论\[[0-9]*?\]')).replace('评论[', '').replace(']', '')time_from = "".join(p.xpath("./div/span[@class='ct']/text()").extract()).split("\xa0来自")item['time'] = time_from[0]item['_from'] = time_from[1]yield itemexcept Exception as e:print(e)continue

九、数据清洗

有些微博的时间可能不是标准的时间，比如它可能显示为刚刚、几分钟前、几小时前、昨天等。这里我们需要统一转化这些时间，实现一个方法，代码如下所示：

    def clear_date(self,publish_time):if "刚刚" in publish_time:publish_time = datetime.now().strftime('%Y-%m-%d %H:%M')elif "分钟" in publish_time:minute = publish_time[:publish_time.find("分钟")]minute = timedelta(minutes=int(minute))publish_time = (datetime.now() - minute).strftime("%Y-%m-%d %H:%M")elif "今天" in publish_time:today = datetime.now().strftime("%Y-%m-%d")time = publish_time.replace('今天', '')publish_time = today + " " + timeelif "月" in publish_time:year = datetime.now().strftime("%Y")publish_time = str(publish_time)publish_time = year + "-" + publish_time.replace('月', '-').replace('日', '')else:publish_time = publish_time[:16]return publish_time

十、编写pipelines

class WeiboPipeline(object):def __init__(self):#创建连接self.conn = pymysql.connect('localhost','root','root','sina')#创建游标self.cursor = self.conn.cursor()#管道处理，将数据存入mysqldef process_item(self, item, spider):sql = "INSERT INTO weibocn(text_fanart,text_transfrom,comment,`like`,transmit,`time`,`from`) VALUES (%s,%s,%s,%s,%s,%s,%s)"print('==================================')self.cursor.execute(sql,(item['text_fanart'],item['text_transfrom'],item['comment'],item['like'],item['transmit'],item['time'],item['_from']))self.conn.commit()return item#关闭sqldbdef close_spider(self,spider):self.cursor.close()self.conn.close()

十一、开启管道

进入settings.py文件中启用管道

ITEM_PIPELINES = {'weibo.pipelines.WeiboPipeline': 300
}

十二、运行scrapy

scrapy crawl weibocn

运行结果如下图：

十三、数据库插入报错InternalError: (pymysql.err.InternalError) (1366, "Incorrect string value: '\\xE6\\xAD

解决方法：https://blog.csdn.net/sinat_41721615/article/details/94979429

十四、绘制图表

这里使用的matplotlib库做出的图，没有审美天赋，所以做出来的图比较丑

import csv
import matplotlib.pyplot as plt
import numpy as np
file = open('weibocn.csv','r')
reader = csv.reader(file)
comment_list = []
like_list = []
transmit_list = []
time_list = []
for comment,like,transmit,time in reader:comment_list.append(comment)like_list.append(like)transmit_list.append(transmit)time_list.append(time)
x = time_list
label = ['评论数','点赞数','转发数']
color = ['red','blue','green']
list = []
list.append(comment_list)
list.append(like_list)
list.append(transmit_list)
for i in range(3):plt.plot(x,list[i],c=color[i],label=label[i])
#设置轴标签
plt.xlabel('weibo time shaft')
plt.ylabel('weibo count')
plt.title('肖战微博流量情况')
#设置刻度#解决中文显示问题
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus'] = False
#显示多图例legend
plt.legend()
#显示图
plt.show()

绘制图表是最头疼的地方，希望有会作图的小伙伴可以优化一下我做出来的图，并在评论区留言，小弟不胜感激！

github项目地址：https://github.com/jlysh/weibocn

Scrapy爬取新浪微博#陈情令相关推荐

[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息（四） —— 应对反爬技术（选取 User-Agent、添加 IP代理池以及Cookies池）
上一篇:[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(三) -- 数据的持久化--使用MongoDB存储爬取的数据最近项目有些忙,很多需求紧急上线,所以一直没能完善< 使用 ...
[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息（二） —— 编写一个基本的 Spider 爬取微博用户信息
上一篇:[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(一) -- 新建爬虫项目在上一篇我们新建了一个 sina_scrapy 的项目,这一节我们开始正式编写爬虫的代码. 选择目标 ...
[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息（三） —— 数据的持久化——使用MongoDB存储爬取的数据
上一篇:[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(二) -- 编写一个基本的 Spider 爬取微博用户信息在上一篇博客中,我们已经新建了一个爬虫应用,并简单实现了爬取一位微 ...
python 制作高斯mask_Python3 练手项目：抓取豆瓣陈情令评论，并制作词云图
(点击上方公众号,可快速关注一起学Python) 链接: https://blog.csdn.net/weixin_43930694/article/details/98334465 一.项目简介 1 ...
powerbi python词云图_Python 练手项目：抓取豆瓣陈情令评论，并制作词云图
一.项目简介 1.内容:循环抓取豆瓣影评中所有观众对<陈情令>的评论,存储在文本文档中,并运用可视化库--词云对其进行分析. 2.目标网站: https://movie.douban.co ...
Scrapy爬取新浪微博用户粉丝数据
一般来说pc端的信息是最为全面的,但是防范措施也是最严格的.所以不能走weibo.com这个域名下进行爬取,新浪微博在pc端的反扒措施较为全面.而手机端的数据则相对好爬取,而且数据都是Json格式,解 ...
Scrapy爬取新浪微博用户信息、用户微博及其微博评论转发
项目介绍新浪微博是国内主要的社交舆论平台,对社交媒体中的数据进行采集是舆论分析的方法之一. 本项目无需cookie,可以连续爬取一个或多个新浪微博用户信息.用户微博及其微博评论转发. 实例选择爬取 ...
Scrapy框架的使用之Scrapy爬取新浪微博
前面讲解了Scrapy中各个模块基本使用方法以及代理池.Cookies池.接下来我们以一个反爬比较强的网站新浪微博为例,来实现一下Scrapy的大规模爬取. 一.本节目标本次爬取的目标是新浪微博用户 ...
Python3网络爬虫开发实战，Scrapy 爬取新浪微博
前面讲解了 Scrapy 中各个模块基本使用方法以及代理池.Cookies 池.接下来我们以一个反爬比较强的网站新浪微博为例,来实现一下 Scrapy 的大规模爬取. 很多人学习python,不知道从 ...

Scrapy爬取新浪微博#陈情令

Scrapy爬取新浪微博#陈情令相关推荐

最新文章

热门文章