scrapy 利用urljoin实现自动翻页蛋壳租房房源信息

一。项目结构

二。模块划分

1.dankespider

import scrapy
from ..items import ZufangItemheaders = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
}class Danke_Spider(scrapy.Spider):name = 'danke'def start_requests(self):url = 'https://www.danke.com/room/sh'yield scrapy.Request(url=url,headers=headers)def parse(self, response):house = response.xpath('//div[@class="r_lbx_cena"]/a/text()').getall()money = response.xpath('//div[@class="r_lbx_moneya"]/span/text()').getall()status = response.xpath('//div[@class="r_lbx_cenb"]').xpath('string(.)').getall()print(house,money,status)print(len(house),len(money),len(status))print(type(house),type(money),type(status))for i in range(len(house)):info = ZufangItem()#清洗数据house[i] = house[i].replace(' ','')status[i] = status[i].replace(' ','')info['house'] = house[i]info['money'] = money[i]info['status'] = status[i]yield info#下一页next_page = response.xpath('/html/body/div[3]/div/div[6]/div[3]/a/@href').getall()[-1]print(next_page)if next_page is not None:print('进入下一页了啊')next_page = response.urljoin(next_page)yield scrapy.Request(next_page,callback=self.parse)

分析：先获取div下的所有a标签的href组成的列表，然后取最后一个href
即：response.xpath(’/html/body/div[3]/div/div[6]/div[3]/a/@href’).getall()[-1]

2.items

# -*- coding: utf-8 -*-# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ZufangItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()house = scrapy.Field()money = scrapy.Field()status = scrapy.Field()

3.pipelines
使用mongoDB数据库

# -*- coding: utf-8 -*-# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymongoconn = pymongo.MongoClient(host='localhost',port=27017)mydb = conn.testmyset = mydb.zufangclass ZufangPipeline(object):def process_item(self, item, spider):print('插入信息')informations = {'house':item['house'],'money':item['money'],'status':item['status'],}myset.insert(informations)return item

4.settings

# -*- coding: utf-8 -*-# Scrapy settings for zufang project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'zufang'SPIDER_MODULES = ['zufang.spiders']
NEWSPIDER_MODULE = 'zufang.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'zufang (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {#    'zufang.middlewares.ZufangSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {#    'zufang.middlewares.ZufangDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'zufang.pipelines.ZufangPipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.main

from scrapy import cmdlinecmdline.execute(['scrapy','crawl','danke'])

scrapy 利用urljoin实现自动翻页蛋壳租房房源信息相关推荐

selenium 翻页_利用selenium实现自动翻页爬取某鱼数据
基本思路: 首先用开发者工具找到需要提取数据的标签列表: 利用xpath定位需要提取数据的列表然后再逐个提取相应的数据: 保存数据到csv: 利用开发者工具找到下一页按钮所在标签: 利用xpath提 ...
ajax将数据显示在class为content的标签中_利用selenium实现自动翻页爬取某鱼数据
基本思路: 首先用开发者工具找到需要提取数据的标签列表: 利用xpath定位需要提取数据的列表然后再逐个提取相应的数据: 保存数据到csv: 利用开发者工具找到下一页按钮所在标签: 利用xpath提 ...
Py：利用pyautogui实现自动将pdf文件(需手动设定pdf总页数)自动翻页并截取另存为图片形式，或自动隔0.1秒自动截笔记本全屏保存到指定文件夹
Py:利用pyautogui实现自动将pdf文件(需手动设定pdf总页数)自动翻页并截取另存为图片形式,或自动隔0.1秒自动截笔记本全屏保存到指定文件夹目录实现步骤和结果核心代码实现步骤和结果 ...
python抖音机器人_Python利用adb 做了个自动翻页，人脸识别抖音机器人
功能: 抖音 APP 自动翻页人脸检测颜值分析自动点赞开发环境: ADB 1.0.41 PyCharm 2019.2.2 技术点: Android ADB tools Tencent AI p ...
[Python Scrapy爬虫] 二.翻页爬取农产品信息并保存本地
前面 "Python爬虫之Selenium+Phantomjs+CasperJS" 介绍了很多Selenium基于自动测试的Python爬虫程序,主要利用它的xpath语句,通过分 ...
python实现二级页面带自动翻页功能，三级页面爬虫苏宁图书。
最近有在做小学期的项目,用scrapy实现爬取图书,下面是我实现的过程. 具体实现功能有:二级页面带自动翻页功能,三级页面的第一页爬取,大小类别的区分. 框架:scrapy 使用到chrome的插件: ...
用python刷微信阅读_使用python让微信读书自动翻页
微信读书目前是朋友圈最流行的一款读书app,但是很遗憾微信读书没有自动翻页模式,不过这可难不倒程序员,写个程序让它自动翻页不久好了. 而且微信读书有这样一个激励政策:"每阅读30分钟可兑1赠 ...
计算机文档翻页怎么设置,Word文档如何设置自动翻页？
在网上阅读一篇长篇小说时,或者浏览一段长篇文档时.最折磨人的就是手要一直用鼠标上来回滚动翻页.时间长了特别累人.那我们在Word阅读像论文这类的长篇文档时也会遇到这种情况.不过在word中除了鼠标我们 ...
WPF 把图片分割成两份自动翻页 WpfFlipPageControl:CtrlBook 书控件
原文:WPF 把图片分割成两份自动翻页 WpfFlipPageControl:CtrlBook 书控件版权声明:本文为博主原创文章,需要转载尽管转载. https://blog.csdn.net/z ...

scrapy 利用urljoin实现自动翻页蛋壳租房房源信息

一。项目结构

二。模块划分

scrapy 利用urljoin实现自动翻页蛋壳租房房源信息相关推荐

最新文章

热门文章

scrapy 利用urljoin实现自动翻页 蛋壳租房房源信息

一。项目结构

二。模块划分

scrapy 利用urljoin实现自动翻页 蛋壳租房房源信息相关推荐

最新文章

热门文章

scrapy 利用urljoin实现自动翻页蛋壳租房房源信息

scrapy 利用urljoin实现自动翻页蛋壳租房房源信息相关推荐