Scrapy介绍

Scrapy 是一种快速的高级 web crawling（网络爬虫）和 web scraping（网页抓取）框架，用于对网站进行爬取并从其页面提取结构化数据。它可以用于广泛的用途，从数据挖掘到监控和自动化测试。在之前的学习中，我们使用Requests获取网页，然后用BS4或其他网页解析工具提取元素内容，这种方式是基于小数据量的处理（少量网页逐一解析，并拼接内容）。为了实现商业化的大规模爬取，如果我们依然采用前面的方法，将会在实现任务的过程中带来很多麻烦，我们应使用Scrapy。

Scrapy是一个完整的爬虫框架。Scrapy的框架如下：

Scrapy将爬虫任务分解为系统化的模块。Scrapy的工作流程如下：

step1，Engine得到Spider的初始爬取请求；
step2，Engine 在 Scheduler 中调度 Requests 并要求从中取出下一个 Requests；
step3，Scheduler将下一个请求返回给Engine；
step4，通过 Downloader Middlewares，Engine 将 Requests 发送到 Downloader；
step5，页面完成下载后（从Internet下载），Downloader 会生成一个响应（并且携带该页面），通过 Downloader Middlewares将其发送到Engine；
step6，Engine 接收到 Downloader 的 Response ，通过 Spider Middleware将其发送给 Spider 处理；
step7，Spider处理响应并将抓取的项目（item）和新的请求返回给Engine，通过Spider Middleware；
step8，Engine将处理后的 item 发送到 Item Pipelines（项目管道，用于存储抓取到的元素，可以后接数据库，也可以接到 json 等其他格式文件），然后将处理后的请求发送到Scheduler并要求从中取出下一个 Requests；
step9，该过程重复（从step3开始），直到不再有来自Scheduler的请求。

在上面的工作流程中，比如Spider的处理与Downloader的下载其实是可以分开进行的，于是我们可以通过多进程的方式提高效率，这一切原因来自scrapy将爬虫工作分解为多个子任务。

简言之，从初始URL开始，Scheduler会将其交给Downloader进行下载，下载之后会交给Spider进行分析，Spider分析出来的结果有两种：

一种是需要进一步抓取的链接，如 “下一页”的链接，它们会被传回Scheduler；
另一种是需要保存的数据，它们被送到Item Pipeline里，进行后期处理（详细分析、过滤、存储等）

Scrapy爬取Quotes to Scrape

我们使用一个简单的例子，爬取引言网站Quotes to Scrape，运行会比较快，因为http://quotes.toscrape.com/tag/humor/只有两页：

import scrapy# 爬取引言网站
class QuotesSpider(scrapy.Spider):"""爬虫的一个重点在于Spider处理数据如果我们不自定义其他模块的功能, 我们只需继承并实现Spider的自定义功能"""# 定义spider的名字name = "quotes"# 设置allowed_domains的含义是过滤爬取的域名, 不在此允许范围内的域名会被过滤, 不会进行爬取# 对于start_urls里的起始爬取页面, 它是不会过滤的, 它的作用是过滤首页之后的页面# allowed_domains = ["quotes.toscrape.com"]# 设置爬取的url, 可以同时从多个页面开始爬, 所有初始url存储于列表start_urls内start_urls = ['http://quotes.toscrape.com/tag/humor/',]def parse(self, response):"""定义解析函数:param response: 从Downloader下载回来的responseresponse可以看成一个网页"""# 结合浏览器的检查功能, 通过XPath选择器定位元素for quote in response.xpath('//div[@class="quote"]'):"""extract()与extract_first()区别extract()返回所有数据, 存在一个list里;extract_first()返回的是一个string, 是extract()结果中第一个值;"""print(quote.xpath('span[@class="text"]/text()').extract_first())print(quote.xpath('span/small[@class="author"]/text()').extract_first())# 每次返回一个与元素相关信息的字典yield {'text': quote.xpath('span[@class="text"]/text()').extract_first(),'author': quote.xpath('span/small[@class="author"]/text()').extract_first(),}# 获取下一页的网址, 并将此作为新的Requests交给Enginenext_page = response.xpath('//li[@class="next"]/a/@href').extract_first()#print('next page is',next_page)if next_page is not None:# urljoin使用请求初始页得到的response作为base_url, 拼接得到链接的完整urlnext_page = response.urljoin(next_page)print('next page is',next_page)# yield返回给Engine的Requests中, 设置了回调函数为self.parse:# 即继续执行这个功能的parse解析(获取本页item+拉取下一页并请求)yield scrapy.Request(next_page, callback=self.parse)# 我们修改完scrapy模块后, 使用命令去执行整个爬虫系统
# 命令格式: scrapy runspider spider_file.py –o output.json
# -o output.json将parse的返回值输出到目标文件, 可以是json格式也可以是其他, 一般用json格式
# -o 会在文件已有内容上继续写信息到文件
import os
import sys
from scrapy.cmdline import execute# 获取当前脚本的完整路径os.path.abspath(__file__)
# 去掉文件名并返回目录os.path.dirname(file_path)
# sys.path是一个列表list, 用于保存已经添加到系统的环境变量路径
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
# 注意execute的参数类型为一个列表
execute(['scrapy', 'runspider', 'quotes_spider.py', '-o', 'output.json'])

其中，我们需要在Spider处理当前页的过程中找到下一页对应的元素：

然后得到下一页的URL，并处理为新的请求交给Engine；

获得的json文件为：

[
{"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen"},
{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin"},
{"text": "\u201cAnyone who thinks sitting in church can make you a Christian must also think that sitting in a garage can make you a car.\u201d", "author": "Garrison Keillor"},
{"text": "\u201cBeauty is in the eye of the beholder and it may be necessary from time to time to give a stupid or misinformed beholder a black eye.\u201d", "author": "Jim Henson"},
{"text": "\u201cAll you need is love. But a little chocolate now and then doesn't hurt.\u201d", "author": "Charles M. Schulz"},
{"text": "\u201cRemember, we're madly in love, so it's all right to kiss me anytime you feel like it.\u201d", "author": "Suzanne Collins"},
{"text": "\u201cSome people never go crazy. What truly horrible lives they must lead.\u201d", "author": "Charles Bukowski"},
{"text": "\u201cThe trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.\u201d", "author": "Terry Pratchett"},
{"text": "\u201cThink left and think right and think low and think high. Oh, the thinks you can think up if only you try!\u201d", "author": "Dr. Seuss"},
{"text": "\u201cThe reason I talk to myself is because I\u2019m the only one whose answers I accept.\u201d", "author": "George Carlin"},
{"text": "\u201cI am free of all prejudice. I hate everyone equally. \u201d", "author": "W.C. Fields"},
{"text": "\u201cA lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.\u201d", "author": "Jane Austen"}
]

打印信息为：

“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Jane Austen
...
“The reason I talk to myself is because I’m the only one whose answers I accept.”
George Carlinnext page is http://quotes.toscrape.com/tag/humor/page/2/“I am free of all prejudice. I hate everyone equally. ”
W.C. Fields
“A lady's imagination is very rapid; it jumps from admiration to love, from love to matrimony in a moment.”
Jane Austen

注意，我们也可以不像上面那样使用脚本去执行命令，我们也可以通过命令行去执行命令，只不过要注意先cd到正确的工作目录下。

Scrapy爬取cnblog

我们同样在Spider上进行修改，爬取cnblog的前10页内容，该实例与爬取引言不同之处在于，我们不是一页一页去爬，而是将10页信息作为初始网页，一起爬取：

import scrapyclass CnBlogSpider(scrapy.Spider):# 定义spider的名字name = "cnblogs"# 设置allowed_domains的含义是过滤爬取的域名, 不在此允许范围内的域名会被过滤, 不会进行爬取# 对于start_urls里的起始爬取页面, 它是不会过滤的, 它的作用是过滤首页之后的页面allowed_domains = ["cnblogs.com"]# 设置爬取的url, 可以同时从多个页面开始爬, 所有初始url存储于列表start_urls内start_urls = ['http://www.cnblogs.com/pick/#p%s' % p for p in range(1, 11)]def parse(self, response):# 结合浏览器的检查功能, 通过XPath选择器定位元素for article in response.xpath('//article[@class="post-item"]'):"""extract()与extract_first()区别extract()返回所有数据, 存在一个list里;extract_first()返回的是一个string, 是extract()结果中第一个值;"""print(article.xpath('section[@class="post-item-body"]/div/a/text()').extract_first().strip())# urljoin使用请求初始页得到的response作为base_url, 拼接得到链接的完整urlprint(response.urljoin(article.xpath('section[@class="post-item-body"]/div/a/@href').extract_first()).strip())# 每次返回一个与元素相关信息的字典yield {'title': article.xpath('section[@class="post-item-body"]/div/a/text()').extract_first().strip(),'link': response.urljoin(article.xpath('section[@class="post-item-body"]/div/a/@href').extract_first()).strip()}import os
import sys
from scrapy.cmdline import executesys.path.append(os.path.dirname(os.path.abspath(__file__)))
execute(['scrapy', 'runspider', 'cnblog_spider.py', '-o', 'output.json'])

打印结果为：

写给程序员的机器学习入门 (九) - 对象识别 RCNN 与 Fast-RCNN
https://www.cnblogs.com/zkweb/p/14048685.html
...
协程到底是什么？看完这个故事明明白白！
https://www.cnblogs.com/xuanyuan/p/13824621.html

Scrapy项目创建

前面的只是一个Spider脚本实现，我们并没有设置更多重要信息，比如header信息，对于某些网站，如果我们不做伪装，我们将被网站识别并屏蔽，从而不能爬取到任何信息甚至是错误信息。为了实现一个完整的爬虫系统，我们需要创建Scrapy项目。

比如我们在TempStack下创建项目qqnews：

(env) ....\TempStack>scrapy startproject qqnews

我们通常将spider类写在spiders目录下，利用settings.py设置爬虫的信息。

使用Scrapy的重点其实在于正确书写XPath或其他选择器。

Scrapy的组件

Scrapy的组件Spider

Scrapy与开发者最相关的就是需要实现Spider，关于Spider组件，通常按照如下流程实现：

初始化请求URL列表，并指定下载后处理response的回调函数。
在parse函数中解析response并返回字典：Item对象，Requests对象。
在回调函数里面，使用选择器解析页面内容，并生成解析后的结果Item。
最后返回的这些Item通常会被持久化到数据库中(使用Item Pipeline)或者使用Feed exports将其保存到文件中。

我们可以将回调函数的功能写在parse函数里，也可以将回调函数与parse分开实现。

Spider在爬取页面时，可以顺序获取URL，也可以根据规则生成初始URL，直接一起爬取。

组件Item

在前面的两个例子中，Spider将元素print打印出来或者yield作为返回值抛出去，这其实是不规范的写法，正确写法应该使用Item。以qqnews为例，我们在这个Scrapy项目的items.py下重写内容：

import scrapyclass QqnewsItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title=scrapy.Field() # 新闻的标题text=scrapy.Field() # 新闻的内容

然后在spider的脚本下加入以下内容：

import scrapy
# TODO:新增内容
from qqnews.items import QqnewsItemclass QQNewsSpider(scrapy.Spider):name = 'qqnews'start_urls = ['https://new.qq.com/ch2/hyrd']def parse(self, response):"""从主页面找到每个新闻的url, 并进行请求, 使用self.parse_question解析每个新闻"""for href in response.xpath('/html/body/div/div/div/div/div/div/ul/li/a/@href'):full_url = response.urljoin(href.extract())print(full_url)yield scrapy.Request(full_url, callback=self.parse_question)def parse_question(self, response):# TODO:新增内容, 将数据输出到itemitem=QqnewsItem()item['title']=response.xpath('//div[@class="qq_article"]/div/h1/text()').extract_first()item['text']="\n".join(response.xpath('//div[@id="Cnt-Main-Article-QQ"]/p[@class="text"]/text()').extract())yield item

组件ItemPipeline

当一个item被spider爬取到之后会被发送给ItemPipeline，然后多个组件按照顺序处理这个item。Item Pipeline常用场景如下：

清理HTML数据；
验证被抓取的数据(检查item是否包含某些字段)；
重复性检查(然后丢弃)；
将抓取的数据存储到数据库中；

我们需要定义一个Python类，实现方法process_item(self, item, spider)即可，返回一个字典或Item，或者抛出DropItem异常丢弃这个Item。其他常用实现方法有：

open_spider(self, spider) spider打开的时执行；
open_spider(self, spider) spider打开的时执行；

比如，我们在pipelines.py文件下定义：

class QqnewsPipeline:def open_spider(self,file):self.out=open(file,'w')def close_spider(self):self.out.close()def process_item(self, item, spider):return item

我们以QqnewsPipeline这个ItemPipeline为例，演示settings.py的作用，我们可以取消settings中默认的注释，并更换为QqnewsPipeline：

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'qqnews.pipelines.QqnewsPipeline': 300,
}

表示我们将本项目的Pipeline设置为QqnewsPipeline。关于数值300，设置中分配给类的整数值决定了它们运行的优先级，通常将这些数字定义在 0-1000 范围内。

通过配置settings，我们可以为scrapy项目增加更多的应用灵活性。

5.Scrapy与相关应用相关推荐

[Python]scrapy爬取当当网书籍相关信息
最近想买两本程序设计的书籍,也就在当当网上面看了下,发现真是太多的书了.所以想着利用爬虫知识爬取下程序设计相关书籍的一些信息. 00_1. 首先是今天所用到的东西 python 3.5 + scrap ...
注意scrapy中SgmlLinkExtractor的默认deny_extensions
在使用scrapy做爬虫的时候碰到一个问题,耗了挺长时间都没有解决,关键是从日志里面没有看出问题,最后还是通过阅读源码才找出问题所在.在此将问题现象以及解决方法记录一下. 现象: 在一个页面中有n多的 ...
python web 框架（八）-- Scrapy
参考:http://www.cnblogs.com/txw1958/archive/2012/07/16/scrapy-tutorial.html 一.简介: Scrapy,Python开发的一个快速 ...
Scrapy网络爬虫框架实战[以腾讯新闻网为例]
本博客为原创博客,仅供技术学习使用.不经允许禁止复制下来,传到百度文库等平台. 目录引言待爬的url 框架架构 items的编写 Spider的编写存储pipelines的编写相关配置sett ...
Python 爬取蚂蜂窝旅游攻略（+Scrapy框架+MySQL）
前言:使用python+scrapy框架爬取蚂蜂窝旅游攻略 Git代码地址:https://github.com/qijingpei/mafengwo 获取代理IP地址的开源项目ProxyPool-m ...
Scrapy翻页爬取示例——列表页、详情页
Scrapy翻页爬取示例--列表页.详情页引言: 本人最近在帮助同事们爬取一批英-泰双语数据,顺带复习了一下scrapy爬虫相关的知识.下面以简单的小项目为例,一起来开始吧! 示例一:爬取列表页本 ...
python3.7安装scrapy_Python3.7下scrapy框架的安装
学习爬虫的时候说到需要安装scrapy框架,然后本人就开始犯难了,怎么都装不好,下面给大家分享一下本人掉过的坑首先安装考虑用到最简单的办法就是命令安装输入:pip install scrapy 结果 ...
Windows安装配置Python Scrapy环境
下载并安装Microsoft Visual C++ Compiler for Python 2.7(lxml的依赖环境,lxml是scrapy的依赖环境) 安装lxml:可直接使用pip安装下载安装 ...
开启Scrapy爬虫之路
文章目录摘要 1.scrapy安装 2.相关命令介绍 2.1全局命令 2.2项目命令 3.scrapy框架介绍 4.Scrapy中数据流的流转 5.第一个scrapy爬虫 5.1创建项目 5.2创建 ...

5.Scrapy与相关应用

目录