一。项目结构

二。模块划分

1.dankespider

import scrapy
from ..items import ZufangItemheaders = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
}class Danke_Spider(scrapy.Spider):name = 'danke'def start_requests(self):url = 'https://www.danke.com/room/sh'yield scrapy.Request(url=url,headers=headers)def parse(self, response):house = response.xpath('//div[@class="r_lbx_cena"]/a/text()').getall()money = response.xpath('//div[@class="r_lbx_moneya"]/span/text()').getall()status = response.xpath('//div[@class="r_lbx_cenb"]').xpath('string(.)').getall()print(house,money,status)print(len(house),len(money),len(status))print(type(house),type(money),type(status))for i in range(len(house)):info = ZufangItem()#清洗数据house[i] = house[i].replace(' ','')status[i] = status[i].replace(' ','')info['house'] = house[i]info['money'] = money[i]info['status'] = status[i]yield info#下一页next_page = response.xpath('/html/body/div[3]/div/div[6]/div[3]/a/@href').getall()[-1]print(next_page)if next_page is not None:print('进入下一页了啊')next_page = response.urljoin(next_page)yield scrapy.Request(next_page,callback=self.parse)


分析:先获取div下的所有a标签href组成的列表,然后取最后一个href
即:response.xpath(’/html/body/div[3]/div/div[6]/div[3]/a/@href’).getall()[-1]

2.items

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ZufangItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()house = scrapy.Field()money = scrapy.Field()status = scrapy.Field()

3.pipelines
使用mongoDB数据库

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymongoconn = pymongo.MongoClient(host='localhost',port=27017)mydb = conn.testmyset = mydb.zufangclass ZufangPipeline(object):def process_item(self, item, spider):print('插入信息')informations = {'house':item['house'],'money':item['money'],'status':item['status'],}myset.insert(informations)return item

4.settings

# -*- coding: utf-8 -*-# Scrapy settings for zufang project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'zufang'SPIDER_MODULES = ['zufang.spiders']
NEWSPIDER_MODULE = 'zufang.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'zufang (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {#    'zufang.middlewares.ZufangSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {#    'zufang.middlewares.ZufangDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'zufang.pipelines.ZufangPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.main

from scrapy import cmdlinecmdline.execute(['scrapy','crawl','danke'])

scrapy 利用urljoin实现自动翻页 蛋壳租房房源信息相关推荐

  1. selenium 翻页_利用selenium实现自动翻页爬取某鱼数据

    基本思路: 首先用开发者工具找到需要提取数据的标签列表: 利用xpath定位需要提取数据的列表 然后再逐个提取相应的数据: 保存数据到csv: 利用开发者工具找到下一页按钮所在标签: 利用xpath提 ...

  2. ajax将数据显示在class为content的标签中_利用selenium实现自动翻页爬取某鱼数据

    基本思路: 首先用开发者工具找到需要提取数据的标签列表: 利用xpath定位需要提取数据的列表 然后再逐个提取相应的数据: 保存数据到csv: 利用开发者工具找到下一页按钮所在标签: 利用xpath提 ...

  3. Py:利用pyautogui实现自动将pdf文件(需手动设定pdf总页数)自动翻页并截取另存为图片形式,或自动隔0.1秒自动截笔记本全屏保存到指定文件夹

    Py:利用pyautogui实现自动将pdf文件(需手动设定pdf总页数)自动翻页并截取另存为图片形式,或自动隔0.1秒自动截笔记本全屏保存到指定文件夹 目录 实现步骤和结果 核心代码 实现步骤和结果 ...

  4. python抖音机器人_Python利用adb 做了个自动翻页,人脸识别抖音机器人

    功能: 抖音 APP 自动翻页 人脸检测 颜值分析 自动点赞 开发环境: ADB 1.0.41 PyCharm 2019.2.2 技术点: Android ADB tools Tencent AI p ...

  5. [Python Scrapy爬虫] 二.翻页爬取农产品信息并保存本地

    前面 "Python爬虫之Selenium+Phantomjs+CasperJS" 介绍了很多Selenium基于自动测试的Python爬虫程序,主要利用它的xpath语句,通过分 ...

  6. python实现二级页面带自动翻页功能,三级页面爬虫苏宁图书。

    最近有在做小学期的项目,用scrapy实现爬取图书,下面是我实现的过程. 具体实现功能有:二级页面带自动翻页功能,三级页面的第一页爬取,大小类别的区分. 框架:scrapy 使用到chrome的插件: ...

  7. 用python刷微信阅读_使用python让微信读书自动翻页

    微信读书目前是朋友圈最流行的一款读书app,但是很遗憾微信读书没有自动翻页模式,不过这可难不倒程序员,写个程序让它自动翻页不久好了. 而且微信读书有这样一个激励政策:"每阅读30分钟可兑1赠 ...

  8. 计算机文档翻页怎么设置,Word文档如何设置自动翻页?

    在网上阅读一篇长篇小说时,或者浏览一段长篇文档时.最折磨人的就是手要一直用鼠标上来回滚动翻页.时间长了特别累人.那我们在Word阅读像论文这类的长篇文档时也会遇到这种情况.不过在word中除了鼠标我们 ...

  9. WPF 把图片分割成两份自动翻页 WpfFlipPageControl:CtrlBook 书控件

    原文:WPF 把图片分割成两份自动翻页 WpfFlipPageControl:CtrlBook 书控件 版权声明:本文为博主原创文章,需要转载尽管转载. https://blog.csdn.net/z ...

最新文章

  1. 美国《时代》周刊公布年度25大最佳发明名单
  2. 甲骨文推出低成本高速公共与混合云方案,矛头直指AWS
  3. 帮人搬,上海的房子貌似不贵
  4. 记录之Learning Deep Features for Discriminative Localization阅读
  5. requests发送http请求
  6. android 获取当前网络,Android 获取当前网络连接的类型信息
  7. 被400万人痛骂!在中国火了22年的“洋网红”,套路彻底失灵了?
  8. vba 窗体单选框怎么传回sub_EXCEL表格VBA中函数的日常使用
  9. unix和linux命令_在Linux / UNIX中查找命令
  10. python导入mat文件_python读取并写入mat文件的方法
  11. Java拦截器实现拦截controller方法
  12. Java自动化测试系列[v1.0.1][PO设计模式]
  13. 酒店后台管理系统、客栈管理、入住会员、房间管理、房源、房型、订单、报表、酒店企业、短信模板、积分、打印、交接班、住宿、入住、锁房、收支流水、房间销售、消费项目、酒店管理、渠道销售、支付管理、连锁酒店
  14. Unity 武器拖尾效果
  15. 基于CANoen协议实现DSP系统与上位机CAN的通讯
  16. 微软下一代集成开发环境 – Visual Studio 2019
  17. win10 tensorflow MTCNN Demo
  18. 【python二级】红楼梦
  19. 余额宝暴富记:为“草根”量身定做
  20. 基于python的火车票订票系统的设计与实现_Python实现12306火车票抢票系统

热门文章

  1. 制定steam教育理念明文
  2. 改计算机名后ansys打不开,更改计算机名后 Ansys重新注册简单办法 20140611.pdf
  3. Java实习生来深圳两周啦
  4. 求助:word 为自定义宏设置快捷键
  5. 画册设计有三种常用的展现方式
  6. 数据结构 第一章 概论
  7. linux终端输入lsblk无命令,Linux中lsblk命令详解
  8. 如何从Windows轻松过渡到Linux?
  9. 常见的http请求头以及响应头
  10. 海银系:工业母机大涨的原因你了解吗?