scrapy 爬取新浪微博的微博列表及微博内容

代码地址：GitHub

参考：博客

通过scrapy框架爬取指定账号的信息和微博

截止到目前(2019年01月15日)的微博账号粉丝排名：

爬取方法：提取网页版的微博接口

1.重写start_request方法

    def start_requests(self):weibo_id = [1195354434, ]for wid in weibo_id:print('https://m.weibo.cn/api/container/getIndex?type=uid&value=' + str(wid))yield Request('https://m.weibo.cn/api/container/getIndex?type=uid&value=' + str(wid), callback=self.parse_userInfo, dont_filter=True,meta={'uid': str(wid)})

2.解析个人信息，并获取containerid

3.爬取博主的微博信息，和他关注的人

    # 解析微博列表def parse_weibo_list(self, response):# 取相关信息，方便爬取下一页next_page = str(int(response.meta['page']) + 1)uid = response.meta['uid']containerid = response.meta['containerid']data = response.textcontent = json.loads(data).get('data')cards = content.get('cards')if (len(cards) > 0):print("-----正在爬取第%s页-----" % str(response.meta['page']))for j in range(len(cards)):card_type = cards[j].get('card_type')# 微博# if card_type == 9:#     mblog = cards[j].get('mblog')#     attitudes_count = mblog.get('attitudes_count')  # 点赞数#     comments_count = mblog.get('comments_count')  # 评论数#     created_at = self.date_format(mblog.get('created_at'))  # 发布时间#     reposts_count = mblog.get('reposts_count')  # 转发数#     scheme = cards[j].get('scheme')  # 微博地址#     # 替换换行后 提取字符串#     text = etree.HTML(str(mblog.get('text')).replace('<br />', '\n')).xpath('string()')  # 微博内容#     pictures = mblog.get('pics')  # 正文配图，返回list#     pic_urls = []  # 存储图片url地址#     if pictures:#         for picture in pictures:#             pic_url = picture.get('large').get('url')#             pic_urls.append(pic_url)#     uid = response.meta['uid']#     # 保存数据#     sinaitem = SinaItem()#     sinaitem["uid"] = uid#     sinaitem["text"] = text#     sinaitem["scheme"] = scheme#     sinaitem["attitudes_count"] = attitudes_count#     sinaitem["comments_count"] = comments_count#     sinaitem["created_at"] = created_at#     sinaitem["reposts_count"] = reposts_count#     sinaitem["pictures"] = pic_urls#     yield sinaitem# 关注信息if card_type == 11:# 获取他关注的人的地址# https://m.weibo.cn/p/index?containerid=231051_-_followers_-_1195354434_-_1042015%3AtagCategory_050&luicode=10000011&lfid=1076031195354434 查看该网页的请求过程fllow_url = str(cards[j]['card_group'][0]['scheme']).replace('https://m.weibo.cn/p/index?', 'https://m.weibo.cn/api/container/getIndex?')print(fllow_url, '----')yield Request(url=fllow_url, callback=self.parse_fllow)# 下一页链接# weibo_list_url = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=' + uid + '&containerid=' + containerid + '&page=' + next_page# response.meta['page'] = next_page# yield Request(weibo_list_url, callback=self.parse_weibo_list, meta=response.meta)

4.根据他关注的人的ID，再次重复此过程

    # 获取关注者的信息def parse_fllow(self, response):data = response.textcontent = json.loads(data).get('data')cards = content.get('cards')# if len(cards) > 0:for card in cards:if card.get('title') == '他的全部关注':for tmp in card.get('card_group'):user = tmp.get('user')# 获取关注的人的IDuid = user.get('id')yield Request('https://m.weibo.cn/api/container/getIndex?type=uid&value=' + str(uid), callback=self.parse_userInfo, dont_filter=True,meta={'uid': str(uid)})

由于此过程是个循环，需要采取一定的控制条件才能爬取完成(如果不被封IP的话)

可先筛选出你感兴趣的用户，再爬取他的微博

防封的话建议采取代理IP的方式，在下载中间件中添加即可

scrapy 爬取新浪微博的微博列表及微博内容相关推荐

Scrapy爬取新浪微博用户信息、用户微博及其微博评论转发
项目介绍新浪微博是国内主要的社交舆论平台,对社交媒体中的数据进行采集是舆论分析的方法之一. 本项目无需cookie,可以连续爬取一个或多个新浪微博用户信息.用户微博及其微博评论转发. 实例选择爬取 ...
Scrapy爬取新浪微博移动版用户首页微博
前言: 本次爬取的是新浪微博移动端(https://m.weibo.cn/),爬取的数据是用户微博首页的第一条微博(如下图),包括文字内容.转发量.评论数.点赞数和发布时间,还有用户名和其所在地区(后 ...
[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息（二） —— 编写一个基本的 Spider 爬取微博用户信息
上一篇:[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(一) -- 新建爬虫项目在上一篇我们新建了一个 sina_scrapy 的项目,这一节我们开始正式编写爬虫的代码. 选择目标 ...
[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息（四） —— 应对反爬技术（选取 User-Agent、添加 IP代理池以及Cookies池）
上一篇:[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(三) -- 数据的持久化--使用MongoDB存储爬取的数据最近项目有些忙,很多需求紧急上线,所以一直没能完善< 使用 ...
[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息（三） —— 数据的持久化——使用MongoDB存储爬取的数据
上一篇:[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(二) -- 编写一个基本的 Spider 爬取微博用户信息在上一篇博客中,我们已经新建了一个爬虫应用,并简单实现了爬取一位微 ...
Scrapy爬取新浪微博用户粉丝数据
一般来说pc端的信息是最为全面的,但是防范措施也是最严格的.所以不能走weibo.com这个域名下进行爬取,新浪微博在pc端的反扒措施较为全面.而手机端的数据则相对好爬取,而且数据都是Json格式,解 ...
Scrapy爬取新浪微博#陈情令
一.起因最近几天陈情令大火,而#肖战#王一博等人也成为众人所熟知的对象,所以我想用Scrapy爬取演员的微博信息来分析下演员信息二. 目标本次爬取的目标是X玖少年团肖战DAYTOY的公开基本信息 ...
Scrapy框架的使用之Scrapy爬取新浪微博
前面讲解了Scrapy中各个模块基本使用方法以及代理池.Cookies池.接下来我们以一个反爬比较强的网站新浪微博为例,来实现一下Scrapy的大规模爬取. 一.本节目标本次爬取的目标是新浪微博用户 ...
Python3网络爬虫开发实战，Scrapy 爬取新浪微博
前面讲解了 Scrapy 中各个模块基本使用方法以及代理池.Cookies 池.接下来我们以一个反爬比较强的网站新浪微博为例,来实现一下 Scrapy 的大规模爬取. 很多人学习python,不知道从 ...

scrapy 爬取新浪微博的微博列表及微博内容

1.重写start_request方法

scrapy 爬取新浪微博的微博列表及微博内容相关推荐

最新文章

热门文章

scrapy 爬取新浪微博 的微博列表及微博内容

1.重写start_request方法

scrapy 爬取新浪微博 的微博列表及微博内容相关推荐

最新文章

热门文章

scrapy 爬取新浪微博的微博列表及微博内容

scrapy 爬取新浪微博的微博列表及微博内容相关推荐