Scrapy--使用phantomjs爬取花瓣网图片

新建一个scrapy工程

(python35) ubuntu@ubuntu:~/scrapy_project$ scrapy startproject huaban

添加一个spider

(python35) ubuntu@ubuntu:~/scrapy_project/huaban/huaban/spiders$ scrapy genspider huaban_pets huaban.com

目录结构如下：

(python35) ubuntu@ubuntu:~/scrapy_project/huaban$ tree -I *.pyc
.
├── huaban
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── __pycache__
│   ├── settings.py
│   └── spiders
│       ├── huaban_pets.py
│       ├── __init__.py
│       └── __pycache__
└── scrapy.cfg

编辑items.py文件

# -*- coding: utf-8 -*-import scrapy

class HuabanItem(scrapy.Item):img_url = scrapy.Field()

编辑huaban_pets.py

# -*- coding: utf-8 -*-
import scrapyclass HuabanPetsSpider(scrapy.Spider):name = 'huaban_pets'allowed_domains = ['huaban.com']start_urls = ['http://huaban.com/favorite/pets/']def parse(self, response):for img_src in response.xpath('//*[@id="waterfall"]/div/a/img/@src').extract():item = HuabanmeinvItem()# 例如img_src为//img.hb.aicdn.com/223816b7fee96e892d20932931b15f4c2f8d19b315735-wgi1w2_fw236# 去掉后面的_fw236就为原图item['img_url'] = 'http:' + img_src[:-6]yield item

编写一个中间键使用phantomj获取网页源码

在middlewares.py添加如下内容：

# -*- coding: utf-8 -*-from scrapy import signals
from selenium import webdriver
from scrapy.http import HtmlResponse
from selenium.webdriver.common.desired_capabilities import DesiredCapabilitiesclass JSPageMiddleware(object):def process_request(self, request, spider):if spider.name == 'hbmeinv':# cap[".page.setting.resourceTimeout"] = 180# cap["chrome.page.setting.loadImage"] = Falsedcap = dict(DesiredCapabilities.PHANTOMJS)# 不载入图片，爬页面速度会快很多dcap["phantomjs.page.settings.loadImages"] = Falsebrowser = webdriver.PhantomJS(executable_path=r'/home/ubuntu/scrapy_project/huabanphantomjs',desired_capabilities=dcap)try:browser.get(request.url)return HtmlResponse(url=browser.current_url, body=browser.page_source,encoding='utf-8',request=request)except:print("get page failed!")finally:browser.quit()el   return

在pipelines.py中添加如下内容下载网页图片:

# -*- coding: utf-8 -*-import urllibclass HuabanmeinvPipeline(object):def process_item(self, item, spider):url = item['img_url']urllib.request.urlretrieve(url, filename=r'/home/ubuntu/scrapy_project/huaban/image/%s.jpg' % url[url.rfind('/')+1:])return item

在setting.py中使用添加的中间键和设置消息头

DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'zh-CN','User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
}# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {#'huabanmeinv.middlewares.MyCustomDownloaderMiddleware': 543,'huabanmeinv.middlewares.JSPageMiddleware': 543,
}# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'huabanmeinv.pipelines.HuabanmeinvPipeline': 300,
}

开始爬取

ubuntu@ubuntu:~/scrapy_project/huaban/huaban/spiders$ scrapy runspider huaban_pets.py

爬取结束后，就可以在/home/ubuntu/scrapy_project/huaban/image目录下看到爬取的图片了，例如：

Scrapy--使用phantomjs爬取花瓣网图片相关推荐

Python3 urllib 爬取花瓣网图片
点我去我的github上看源码 **花瓣网是动态的,所以要抓包分析,但我真的累的不行,不想写教程了,我源码里有注释
python实现爬虫收集图片花瓣网_【动态网页】python3爬取花瓣网图片
步骤:1 分析源码,找到网页地址以及下拉刷新后的地址,提取每张图片的信息,包括pin_id,key,type,通过key可以唯一确定一张图片的地址. 2 编写脚本,使用request库模拟请求举个 ...
python从键盘上输入五个数字打一成语_Python快速爬取车标网图片，以后不要说这什么车你不认识了！...
知识不分边界...... 人,为什么要读书?举个例子: 当看到天边飞鸟,你会说:"落霞与孤鹜齐飞,秋水共长天一色."而不是:"卧靠,好多鸟."; 当你失恋时你低 ...
python爬虫之正则表达式（爬取妹子网图片）
目录正则表达式正则表达式(regular expression)描述了一种字符串匹配的模式(pattern),可以用来检查一个串是否含有某种子串.将匹配的子串替换或者从某个串中取出符合某个条件的子 ...
Scrapy爬虫之爬取当当网图书畅销榜
本次将会使用Scrapy来爬取当当网的图书畅销榜,其网页截图如下: 我们的爬虫将会把每本书的排名,书名,作者,出版社,价格以及评论数爬取出来,并保存为csv格式的文件.项目的具体创建就不再多讲 ...
用scrapy+selenium + phantomjs 爬取vip网页,保存为json格式,写入到mysql数据库,下载图片(二)
接上一编 weipin.py文件的代码 : # -*- coding: utf-8 -*- import scrapy from weipinhui.items import WeipinhuiIte ...
java的简单网络爬虫（爬取花瓣网的图片）
因为本人对爬虫比较感兴趣,加上之前也写过一些简单的python爬虫,所以在学完java基础后写了一个简单的网络图片爬虫.废话不多说直接上过程代码.(爬取的图源来自花瓣网:https://huaban. ...
Python 爬虫: 抓取花瓣网图片
接触Python也好长时间了,一直没什么机会使用,没有机会那就自己创造机会!呐,就先从爬虫开始吧,抓点美女图片下来. 废话不多说了,讲讲我是怎么做的. 1. 分析网站想要下载图片,只要知道图片的地址 ...
使用scrapy爬虫框架爬取慕课网全部课程信息
爬取的链接: http://www.imooc.com/course/list 爬取的内容: 课程链接, 课程的图片url, 课程的名称, 学习人数, 课程描述 1.安装scrapy模块 pip in ...

Scrapy--使用phantomjs爬取花瓣网图片

新建一个scrapy工程

添加一个spider

编辑items.py文件

编辑huaban_pets.py

编写一个中间键使用phantomj获取网页源码

在pipelines.py中添加如下内容下载网页图片:

在setting.py中使用添加的中间键和设置消息头

开始爬取

Scrapy--使用phantomjs爬取花瓣网图片相关推荐

最新文章

热门文章