北京艾丽斯妇科医院(http://fuke.fuke120.com/)

首先先说一下配置splash

1.利用pip安装scrapy-splash库

pip install scrapy-splash

2.现在就要用到另一个神器(Docker)

Docker下载地址:https://www.docker.com/community-edition#/windows

3.安装好Docker后启动Docker拉取镜像

docker pull scrapinghub/splash

4.利用Docker运行splash

docker run -p 8050:8050 scrapinghub/splash(运行之后大家可以去浏览器输入http://192.168.99.100:8050检查Docker是否正确)

5settings.py配置

SPLASH_URL = 'http://192.168.99.100:8050'(重中之重,一个大坑,一定要注意这个IP就是192.168.99.100,我就一直用的自己IP一直没运行成功)
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

ROBOTSTXT_OBEY = True(此处注意,有的网站是True,而有的网站需要把它改成False)

 爬虫的py文件1.py

# -*- coding: utf-8 -*-
import re
from urllib.request import urlopen
from scrapy.http import Request
# from urllib.request import Request
from bs4 import BeautifulSoup
from lxml import etree
import pymongo
import scrapy
from scrapy.selector import HtmlXPathSelectorclient = pymongo.MongoClient(host="127.0.0.1")
db = client.Health
collection = db.Healthclass  # 表名classificationimport redis  # 导入redis数据库r = redis.Redis(host='127.0.0.1', port=6379, db=0)ii = 0
class healthcareClassSpider(scrapy.Spider):name = "HealthCare"allowed_domains = ["fuke120.com"]  # 允许访问的域start_urls = ["http://fuke.fuke120.com/",]# 每爬完一个网页会回调parse方法def parse(self, response):global iihxs = HtmlXPathSelector(response)hx = hxs.select('//div[@id="allsort"]/div[@class="item"]/span/a')hx1 = hxs.select('//div[@id="allsort"]/div[@class="item born"]/span/a')# hx2 = hxs.select('//div[@id="allsort"]/div[@class="item"]/div[@class="i-mc"]/div[@class="i-mc01"]/ul[@class="w_ul01"]/li/a')for secItem in hx:ii+=1url = secItem.select("@href").extract()c = "http://fuke.fuke120.com"+url[0]name = secItem.select("text()").extract()print(c)print(name)classid = collection.insert({'healthclass': name, 'pid': None})healthurl = '%s,%s,%s' % (classid, c, ii)r.lpush('healthclassurl',healthurl)for secItem1 in hx1:url = secItem1.select("@href").extract()c1 = "http://fuke.fuke120.com"+url[0]name1 = secItem1.select("text()").extract()print(c1)print(name1)classid = collection.insert({'healthclass': name1, 'pid': None})healthurl = '%s,%s,%s' % (classid, c1, 0)r.lpush('healthclassurl', healthurl)

  2.py

# -*- coding: utf-8 -*-
import re
from urllib.request import urlopen
from urllib.request import Request
from bs4 import BeautifulSoup
from lxml import etree
import pymongo
import scrapy
from scrapy.selector import HtmlXPathSelector
from bson.objectid import ObjectId
# from scrapy.http import Request
# from urllib.request import urlopen
from scrapy.http import Request
# from hello.items import ZhaopinItem
# from scrapy.spiders import CrawlSpider, Rule
# from scrapy.linkextractors import LinkExtractor
from urllib.request import Request,ProxyHandler
from urllib.request import build_opener
client = pymongo.MongoClient(host="127.0.0.1")
db = client.Health            #库名dianping
collection = db.Diseaseclass          #表名classificationimport redis        #导入redis数据库r = redis.Redis(host='192.168.60.112', port=6379, db=0, charset='utf-8')
class healthcareClassSpider(scrapy.Spider):name = "HealthCare1"allowed_domains = ["fuke120.com"]  # 允许访问的域dict = {}start_urls = []def __init__(self):a = r.lrange('healthclassurl', 0,-1)for item in a:healthurl = bytes.decode(item)arr = healthurl.split(',')healthcareClassSpider.start_urls.append(arr[1])num = arr[2]pid = arr[0]url = arr[1]self.dict[url] = {"pid": pid, "num": num}def parse(self, response):nameInfo = self.dict[response.url]pid1 = nameInfo['pid']pid = ObjectId(pid1)num = nameInfo['num']hxs = HtmlXPathSelector(response)hx = hxs.select('//div[@class="x_con02_2"]/div[@class="x_con02_3"]/ul/li/p/a')for secItem in hx:url = secItem.select("@href").extract()url = "http://fuke.fuke120.com"+url[0]name = secItem.select("text()").extract()print(url)print(name)classid = collection.insert({'Diseaseclass': name, 'pid': pid})diseaseclassurl = '%s,%s,%s' % (classid, url, pid)r.lpush('diseaseclassurl', diseaseclassurl)

  3.py

# -*- coding: utf-8 -*-
import re
from urllib.request import urlopen
from urllib.request import Request
from bs4 import BeautifulSoup
from lxml import etree
import pymongo
import scrapy
from scrapy_splash import SplashMiddleware
from scrapy.http import Request, HtmlResponse
from scrapy_splash import SplashRequest
from scrapy.selector import Selector
from scrapy.selector import HtmlXPathSelector
from bson.objectid import ObjectId
# from diseaseHealth.diseaseHealth.spiders.SpiderJsDynamic import phantomjs1
# from scrapy.http import Request
# from urllib.request import urlopen
from scrapy.http import Requestclient = pymongo.MongoClient(host="127.0.0.1")
db = client.Health  # 库名dianping
collection = db.Treatclass  # 表名classification
#
import redis  # 导入redis数据库
#
r = redis.Redis(host='192.168.60.112', port=6379, db=0, charset='utf-8')class healthcareClassSpider(scrapy.Spider):name = "HealthCare2"allowed_domains = ["fuke120.com"]  # 允许访问的域dict = {}start_urls = []def __init__(self):a = r.lrange('diseaseclassurl', 0,-1)for item in a:healthurl = bytes.decode(item)arr = healthurl.split(',')healthcareClassSpider.start_urls.append(arr[1])num = arr[2]pid = arr[0]url = arr[1]self.dict[url] = {"pid": pid, "num": num}def start_requests(self):for url in self.start_urls:yield SplashRequest(url, self.parse, args={'wait': 0.5})def parse(self, response):# a = response.body.decode('utf-8')# print(a)nameInfo = self.dict[response.url]pid1 = nameInfo['pid']pid = ObjectId(pid1)num = nameInfo['num']print(num)print(pid)hxs = HtmlXPathSelector(response)hx = hxs.select('//div[@class="dh01"]/ul[@class="ul_bg01"]/li/a')for secItem in hx:url = secItem.select("@href").extract()c = "http://fuke.fuke120.com" + url[0]name = secItem.select("text()").extract()print(c)print(name)classid = collection.insert({'Treatclass': name, 'pid': pid})treatclassurl = '%s,%s,%s' % (classid, c, pid)r.lpush('treatclassurl', treatclassurl)

  大功告成,主要还是为了使用scrapy-splash。

 

转载于:https://www.cnblogs.com/wangyuhangboke/p/8025067.html

配置scrapy-splash+python爬取医院信息(利用了scrapy-splash)相关推荐

  1. python爬取控制台信息_python爬虫实战之爬取智联职位信息和博客文章信息

    1.python爬取招聘信息 简单爬取智联招聘职位信息 # !/usr/bin/env python # -*-coding:utf-8-*- """ @Author  ...

  2. python爬取机票信息

    python爬取机票信息 飞机和高铁列车不同,在同样的航线中有着不同的票价,借此我们希望获取尽量多的机票信息来分析机票的变化规律. 首先我们选取京东机票为爬取对象http://jipiao.jd.co ...

  3. Python 爬取网页信息并保存到本地爬虫爬取网页第一步【简单易懂,注释超级全,代码可以直接运行】

    Python 爬取网页信息并保存到本地[简单易懂,代码可以直接运行] 功能:给出一个关键词,根据关键词爬取程序,这是爬虫爬取网页的第一步 步骤: 1.确定url 2.确定请求头 3.发送请求 4.写入 ...

  4. Python爬取售房信息并保存至CSV文件

    Python爬取售房信息并保存至CSV文件 在上一篇文章: Python爬取租房信息并保存至Excel文件,介绍了如何使用Python爬取租房信息并保存至Excel文件,在本案例中则是使用Python ...

  5. python爬取网页信息

    最近在学习python,发现通过python爬取网页信息确实方便,以前用C++写了个简单的爬虫,爬取指定网页的信息,代码随便一写都几百行,而要用python完成相同的工作,代码量相当少.前几天看到了一 ...

  6. python爬取链家网实例——scrapy框架爬取-链家网的租房信息

    说明: 本文适合scrapy框架的入门学习. 一.认识scrapy框架 开发python爬虫有很多种方式,从程序的复杂程度的角度来说,可以分为:爬虫项目和爬虫文件. scrapy更适合做爬虫项目,ur ...

  7. 运用 Python 爬取私募基金信息_request

    2018.11.22 爬虫要求: 目标 url:http://gs.amac.org.cn/amac-infodisc/res/pof/fund/index.html 抓取信息:每条基金的 基金名称 ...

  8. 运用 Python 爬取私募基金信息_Scrapy

    2018.11.23 上一篇采用 Selenium 和 Ajax 参数分析两种方法来爬取了基金信息.链接: https://blog.csdn.net/luckycdy/article/details ...

  9. python爬取电影信息并插入至MySQL数据库

    在上篇博文中,博主使用python爬取了豆瓣电影的影片信息,接下来,博主考虑到在之前做的JavaWeb电影院项目中需要向数据库中一个个的插入影片数据,十分的繁琐,那么在使用了python爬虫后,这个操 ...

最新文章

  1. RabbitMQ 入门系列(5)— RabbitMQ 使用场景优缺点
  2. 拼多多面试:如何用 Redis 统计独立用户访问量?
  3. oracle decode 01427,(oracle)在DECODE中的SELECT(返回多行)
  4. win8+ubuntu,ubuntu中打开NTFS文件
  5. ubuntu10.10编译qtopia-2.2.0 问题总结及分析
  6. 2017年网易校招题 买苹果
  7. 《深入理解分布式事务》第二章 MySQL 事务的实现原理
  8. 没注意开源软件的文档和对应版本号,悲剧了
  9. mysql有多少个端口号_mysql默认端口号(mysql端口号是多少)
  10. 在Python中分词
  11. 翻译: 中国北斗卫星导航系统 全球导航卫星系统 (GNSS)
  12. 百度地图设置显示定位服务器,百度地图API示例之设置地图显示范围
  13. 客户端session与服务端session
  14. 干货:1分钟了解巨量引擎准入行业和资质规范
  15. 红帽linux安装中文输入法,Redhat安装中文输入法
  16. kindle导入电子书方法
  17. 佛经小故事--《盲龟浮木》
  18. Linux丢包问题排查思路
  19. 怎么把dell 灵越2020一体机的win8系统改成win7?
  20. FT232R 芯片STC15F2K60S2电脑通讯功能实现代码

热门文章

  1. 台湾服务器租用注意事项
  2. 索尼爱立信哪款手机java最强,急!!!手机索尼爱立信S500C与三星SGH-E958哪款好?...
  3. ASP.NET SignalR 与 LayIM2.0 配合轻松实现Web聊天室(十四)之漏掉的客服消息
  4. .Net 日志系统-Windows日志
  5. vue实现签到功能,带动画需引入animate,有接口
  6. 2022-2027(新版)中国硅胶管材行业销售策略与供需前景预测报告
  7. Process 执行shell
  8. 第七代电子计算机,Intel第七代酷睿为PC带来的10大革新:Optane存储服了
  9. 出现d3dcompiler_41.dll错误怎么解决
  10. elementui中穿梭框的用法,解决选择重复冲突的问题vue