Scrapy定制管道爬取pexels.com网站信息

参考了官方文档，链接https://scrapy-chs.readthedocs.io/zh_CN/latest/topics/images.html

处理文本的pipeline参考了这篇博客https://blog.csdn.net/killeri/article/details/80228089

items.py:

import scrapyclass XicidailispiderItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()image_urls = scrapy.Field()images = scrapy.Field()image_paths=scrapy.Field()photographer=scrapy.Field()

爬虫文件：

class PexelcrawlSpider(scrapy.Spider):name = 'pexelCrawl'allowed_domains = ['pexels.com']start_urls = ['https://www.pexels.com/']def parse(self, response):selectors = response.xpath('//div[@class="hide-featured-badge hide-favorite-badge"]//article')imageurls=[]photographers=[]for selector in selectors:imageurls.append(selector.xpath('./a[1]/img/@src').get())photographers.append(selector.xpath('./a[2]/span/text()').get())itemdict=XicidailispiderItem()itemdict['image_urls']=imageurls#链接必须以列表形式存放itemdict['photographer']=photographersreturn itemdict

管道文件，包含图片处理和文本处理两个管道，图片处理管道继承自ImagesPipeline，其中对get_media_requests和item_completed两个函数做了重写，完成下载图片后获取了图片的存储路径保存在item['image_paths']里。

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItemclass XicidailispiderPipeline(ImagesPipeline):# 得到图片网址并发出下载请求def get_media_requests(self, item, info):for imageurl in item['image_urls']:yield scrapy.Request(imageurl)#管道处理这些请求，完成下载后的结果以2元素的元组列表的形式传送到item_completed方法#[(True,{'checksum': '2b00042f7481c7b056c4b410d28f33cf','path': 'full/7d97e98f8af710c7e7fe703abc8f639e0ee507c4.jpg','url': 'http://www.example.com/images/product1.jpg'}),#(True,{'checksum': 'b9628c4ab9b595f72f280b90c4fd093d','path': 'full/1ca5879492b8fd606df1964ea3c1e2f4520f076f.jpg','url': 'http://www.example.com/images/product2.jpg'}),#(False,Failure(...))]def item_completed(self, results, item, info):image_paths = [x['path'] for ok, x in results if ok]#for ok,x in results:#   if ok:#       image_paths.append(x['path'])if not image_paths:raise DropItem("Item contains no images")item['image_paths'] = image_pathsreturn itemclass textPipeline(object):#处理文本的pipelinedef __init__(self):self.file=open('pictureinfo.csv','wb')#将文本存储到文件中def process_item(self,item,spider):self.file.write(bytes(str(item), encoding='utf-8'))return item

在settings.py里设置两个管道的调用

ITEM_PIPELINES = {'xicidailiSpider.pipelines.XicidailispiderPipeline': 1,'xicidailiSpider.pipelines.textPipeline':300}

运行遇到了问题：

应该爬取图片60张，实际只爬取了40张，而列表image_url和photographer都是61个值，第41个值为None，这说明管道函数处理request url的时候到第41个就因错误（url为空）而停止了，没有对image_path赋值，也没有执行后面的写入csv文件的函数。

网页中，//article过滤出61项，而//article/a[1]只有60项，说明有一项的article可能预加载了但是并没有显示在网页上，所以获取不到相应的图片src等信息。实际的前端界面是鼠标拖动到下方后会自动载入新的图片。

解决：

        for selector in selectors:thisurl = selector.xpath('./a[1]/img/@src').get()if thisurl is not None:imageurls.append(thisurl)thisauthor = selector.xpath('./a[2]/span/text()').get()if thisauthor is not None:photographers.append(thisauthor)

在构建list的过程中加入非空的判断语句，使item中没有None值。

Scrapy定制管道爬取pexels.com网站信息相关推荐

Python爬取斗鱼直播网站信息
一.需求爬取斗鱼直播网站信息,如直播名字.主播名字.热度.图片和房间网址,将这些数据保存到csv文件中,并单独创建文件夹保存图片. 斗鱼直播网址:https://www.douyu.com/g_LO ...
【python实现网络爬虫（5）】第一个Scrapy爬虫实例项目（Scrapy原理及Scrapy爬取名言名句网站信息）
Scrapy介绍总共有五部分组成的:具体的流程可看图示引擎.调度器.下载器.蜘蛛和项目管道爬取流程针对于每个URL, Scheduler -> Downloader -> Spid ...
西山小菜鸟之Scrapy学习笔记---爬取企查查网站公司基本信息
前言本文主要采取cookie登录的方式爬取企查查网站的公司的基本信息,后期会继续发布关于爬取企查查网站上的公司的裁判文书信息.链接为:企查查本文中若存在不详细的地方欢迎各位大神网友提问,若有错误 ...
Python爬虫爬取伯乐在线网站信息
一.环境搭建 1.创建环境执行pip install scrapy安装scrapy 使用scrapy startproject ArticleSpider创建scrapy项目使用pycharm导入 ...
Scrapy实战：爬取知乎用户信息
思路:从一个用户(本例为"张佳玮")出发,来爬取其粉丝,进而爬取其粉丝的粉丝- 先来观察网页结构: 审查元素: 可以看到用户"关注的人"等信息在网页中用json ...
scrapy爬虫，爬取整形美容网医生信息
因为公司的商务部门需要网站上的医生信息,所以让我爬取. 网址:https://www.010yt.com/doc/ 因为之前学习了scrapy爬虫,所以在爬取这个项目信息的时候就用了这个信息. 首先就 ...
python爬虫爬取ip记录网站信息并存入数据库
1 import requests 2 import re 3 import pymysql 4 #10页仔细观察路由 5 db = pymysql.connect("localhost& ...
mysql scrapy 重复数据_大数据python（scrapy）爬虫爬取招聘网站数据并存入mysql后分析...
基于Scrapy的爬虫爬取腾讯招聘网站岗位数据视频(见本头条号视频) 根据TIOBE语言排行榜更新的最新程序语言使用排行榜显示,python位居第三,同比增加2.39%,为什么会越来越火,越来越受欢迎 ...
scrapy爬取知名问答网站(解决登录+保存cookies值+爬取问答数据)--完整版完美解决登录问题
菜鸟写Python:scrapy爬取知名问答网站实战(3) 一.文章开始: 可能看到这篇文章的朋友,大多数都是受慕课网bobby讲师课程的影响,本人也有幸在朋友处了解过这个项目,但是似乎他代码中登录 ...

Scrapy定制管道爬取pexels.com网站信息

Scrapy定制管道爬取pexels.com网站信息相关推荐

最新文章

热门文章