简单的scrapy爬取下载小说

一.scrapy简介

scrapy结构

items.py：用来存放爬虫爬取下来数据的模型
middlewares.py：用来存放各种中间件的文件
pipelines.py：用来将items的模型存储到本地磁盘中
settings.py：本爬虫的一些配置信息（比如请求头、多久发送一次请求、IP代理池等）
scrapy.cfg：项目的配置文件
spiders包：以后所有的爬虫都是存放到这里

二.目标

爬取https://www.xyyuedu.com/wgmz/dongyeguiwu/mimi/284492.html
网页小说并下载为json文件

三.代码

首先我们进入网页，创建python文件mimi.py，初始网址为https://www.xyyuedu.com/wgmz/dongyeguiwu/mimi/284492.html

    start_urls = ['https://www.xyyuedu.com/wgmz/dongyeguiwu/mimi/284492.html']

进入开发者模式后，使mimidiv为下述xpath地址后，在其中for循环，找到目录章节和内容以及下一篇的xpath地址。
接下来我们到最后一章，发现“下一篇”并没有href链接，那么我们可以用下述代码来表示是否回调。

        if not next_url:returnelse:yield scrapy.Request(next_url,callback=self.parse)

故mimi.py全部代码如下：

# -*- coding: utf-8 -*-
import scrapy
from lxml import etree
from scrapy.http.response.html import HtmlResponse
from scrapy.selector.unified import SelectorList
from mimipassage.items import MimipassageItem
class MimiSpider(scrapy.Spider):name = 'mimi'allowed_domains = ['www.xyyuedu.com']start_urls = ['https://www.xyyuedu.com/wgmz/dongyeguiwu/mimi/284492.html']def parse(self, response):mimidiv=response.xpath("//div[@class='main-wrap']")for mimi in mimidiv:title = response.xpath("//div[@class='main-wrap']/div[@id='arcxs_title']/h1//text()").get().strip()content=response.xpath(".//div[@class='onearcxsbd']//text()").getall()content="".join(content).strip()item=MimipassageItem(title=title,content=content)yield itemnext_url1=response.xpath(".//div[@class='mzpage']/b[2]/a[@class='prevPage'][last()]/@href").get()next_url='https://www.xyyuedu.com'+next_url1if not next_url:returnelse:yield scrapy.Request(next_url,callback=self.parse)

接下来我们设置items.py文件，因为只爬取了标题和内容：

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass MimipassageItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title=scrapy.Field()content=scrapy.Field()

在pipelines.py中，我们是要保存为json模式：

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import json
from scrapy.exporters import JsonItemExporterclass MimipassagePipeline:def __init__(self):self.fp=open("mimipp.json","wb")self.exporter=JsonItemExporter(self.fp,ensure_ascii=False,encoding="utf-8")def process_item(self, item, spider):self.exporter.export_item(item)return itemdef close_spider(self):self.exporter.finish_exporting()

在settings.py中有几处需要更改的地方：

ROBOTSTXT_OBEY = False

DOWNLOAD_DELAY = 1

DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'
}

ITEM_PIPELINES = {'mimipassage.pipelines.MimipassagePipeline': 300,
}

最后，您既可以在命令中运行：

scrapy crawl mimi

也可以创建一个start.py文件运行:

from scrapy import cmdline
cmdline.execute("scrapy crawl mimi".split(" "))

至此，简单的运用scrapy框架爬取小说到此为止，其中不乏错误与不足之处，也是因为笔者仍在学习当中，还未钻研透彻。平等探讨，欢迎至极，出口伤人，避之不及。
非常感谢您的停留，如果喜欢，我很高兴，谢谢！