python爬虫：用scrapy框架爬取链家网房价信息并存入mongodb

1.目标界面：https://dg.lianjia.com/ershoufang/
2.爬取的信息：①标题 ②总价 ③小区名 ④所在地区名 ⑤详细信息 ⑥详细信息里的面积
3. 存入：MongoDB
上面链接是东莞的二手房信息，如果需要爬取别的信息更改url即可,因为网页结构没变：
https://bj.lianjia.com/ershoufang/ 北京二手房信息
https://gz.lianjia.com/ershoufang/ 广州二手房信息
https://gz.lianjia.com/ershoufang/tianhe 广州天河区二手房信息
…
下面就是具体的代码了：

ershoufang_spider.py:

import scrapy
from lianjia_dongguan.items import LianjiaDongguanItem  #这是item.py定义的classclass lianjiadongguanSpider(scrapy.Spider):name = "ershoufang" # 爬虫的名字，后面运行要用global start_pagestart_page=1start_urls=["https://gz.lianjia.com/ershoufang/haizhu/pg"+str(start_page)]def parse(self, response):for item in response.xpath('//div[@class="info clear"]'):yield {"title": item.xpath('.//div[@class="title"]/a/text()').extract_first().strip(),"Community": item.xpath('.//div[@class="positionInfo"]/a[1]/text()').extract_first(),"district": item.xpath('.//div[@class="positionInfo"]/a[2]/text()').extract_first(),"price": item.xpath('.//div[@class="totalPrice"]/span/text()').extract_first().strip(),"area": item.xpath('.//div[@class="houseInfo"]/text()').re("\d室\d厅 \| (.+)平米")[0],"info": item.xpath('.//div[@class="houseInfo"]/text()').extract_first().replace("平米", "㎡").strip()}i=1while i <= 15:j=i+start_pagei = i + 1next_url="https://gz.lianjia.com/ershoufang/haizhu/pg"+str(j)yield scrapy.Request(next_url,dont_filter=True,callback=self.parse)

items.py:

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class LianjiaDongguanItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()title=scrapy.Field()info=scrapy.Field()location=scrapy.Field()price=scrapy.Field()Community=scrapy.Field()pass

middlewares.py:

# -*- coding: utf-8 -*-# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlfrom scrapy import signalsclass LianjiaDongguanSpiderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(self, response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(self, response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, dict or Item objects.for i in result:yield idef process_spider_exception(self, response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Request, dict# or Item objects.passdef process_start_requests(self, start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)class LianjiaDongguanDownloaderMiddleware(object):# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_request(self, request, spider):# Called for each request that goes through the downloader# middleware.# Must either:# - return None: continue processing this request# - or return a Response object# - or return a Request object# - or raise IgnoreRequest: process_exception() methods of#   installed downloader middleware will be calledreturn Nonedef process_response(self, request, response, spider):# Called with the response returned from the downloader.# Must either;# - return a Response object# - return a Request object# - or raise IgnoreRequestreturn responsedef process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.# Must either:# - return None: continue processing this exception# - return a Response object: stops process_exception() chain# - return a Request object: stops process_exception() chainpassdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)from fake_useragent import UserAgentclass UserAgentMiddleware(object):def __init__(self, crawler):super().__init__()self.ua = UserAgent()@classmethoddef from_crawler(cls, crawler):return cls(crawler)def process_request(self, request,spider):# ！这也是写入headers的一种方式，可以写进去request.headers['User-Agent'] = self.ua.random# request.headers['Cookie'] = {"ws":"wer"} 也可以写入print('User-Agent:'+str(request.headers.getlist('User-Agent'))) # 返回一个list，这种读法稳，都能读出来print('User-Agent:' + str(request.headers['User-Agent']))       # 返回一个str，这种读法必须有上面写法对应，否则实际上有也报错def process_response(self, request, response, spider):print("请求头Cookie：" + str(request.headers.getlist('Cookie')))print("响应头Cookie：" + str(response.headers.getlist('Set-Cookie')))print("这是响应头："+str(response.headers))print("这是请求头："+str(request.headers))return response

pipelines.py:

# -*- coding: utf-8 -*-
# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymongo
from scrapy.utils.project import get_project_settings
settings = get_project_settings()class LianjiaPipeline(object):def __init__(self):host = settings['MONGODB_HOST']port = settings['MONGODB_PORT']db_name = settings["MONGODB_DBNAME"]client = pymongo.MongoClient(host=host, port=port)  # 连接mongodbdb = client[db_name]  # 连接数据库self.post = db[settings["MONGODB_DOCNAME"]]  # self.post定义collectiondef process_item(self, item, spider):zufang = dict(item)  # 生成器 转成字典格式self.post.insert(zufang)  # 插入collectionsreturn item

settings.py:

# -*- coding: utf-8 -*-
# Scrapy settings for lianjia_dongguan project
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'lianjia_dongguan'
SPIDER_MODULES = ['lianjia_dongguan.spiders']
NEWSPIDER_MODULE = 'lianjia_dongguan.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'lianjia_dongguan (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = True# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 10
DOWNLOAD_DELAY_RANDOM=True
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
# 当COOKIES_ENABLED是注释#的时候scrapy默认没有开启cookie
# 当COOKIES_ENABLED设置为False的时候scrapy默认使用了settings里面的cookie
# 当COOKIES_ENABLED设置为True的时候scrapy就会把settings的cookie关掉，使用自定义cookie
COOKIES_ENABLED = False
COOKIES_DEBUG = False     # 是否打印set-cookie# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en','Cookie': {'TY_SESSION_ID': '84ad2c08-9ea5-4193-927a-221fd9bae52b', 'lianjia_uuid': '4fc8b1ad-d17f-45c5-85a3-6e0072d35e1e', 'UM_distinctid': '170a9b9075c4d0-0efd9a9b0cf32d-34564a7c-e1000-170a9b9075d52f', '_jzqc': '1', '_jzqy': '1.1583395441.1583395441.1.jzqsr', '_jzqckmp': '1', '_smt_uid': '5e60b270.14483188', 'sajssdk_2015_cross_new_user': '1', '_ga': 'GA1.2.887972367.1583395443', '_gid': 'GA1.2.365215785.1583395443', 'select_city': '441900', '_qzjc': '1', 'Hm_lvt_9152f8221cb6243a53c83b956842be8a': '1583395559', 'sensorsdata2015jssdkcross': '%7B%22distinct_id%22%3A%22170a9b909b4715-0769015f66a016-34564a7c-921600-170a9b909b553%22%2C%22%24device_id%22%3A%22170a9b909b4715-0769015f66a016-34564a7c-921600-170a9b909b553%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D', 'CNZZDATA1254525948': '426683749-1583390573-https%253A%252F%252Fwww.baidu.com%252F%7C1583401373', 'CNZZDATA1255604082': '143512665-1583390710-https%253A%252F%252Fwww.baidu.com%252F%7C1583401510', 'lianjia_ssid': 'd4d26773-ce0d-8cfe-ab64-d39d54960c3c', 'CNZZDATA1255633284': '201504093-1583390793-https%253A%252F%252Fwww.baidu.com%252F%7C1583401593', '_jzqa': '1.637230317292238100.1583395441.1583401552.1583405268.4', 'Hm_lpvt_9152f8221cb6243a53c83b956842be8a': '1583405272', '_qzja': '1.884599034.1583395479079.1583401552386.1583405267952.1583405267952.1583405271972.0.0.0.13.4', '_qzjb': '1.1583405267952.2.0.0.0', '_qzjto': '13.4.0', '_jzqb': '1.2.10.1583405268.1'},'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5'}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {#    'lianjia_dongguan.middlewares.LianjiaDongguanSpiderMiddleware': 543,
#}# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {#    'lianjia_dongguan.middlewares.LianjiaDongguanDownloaderMiddleware': 543,
#}# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
#}# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {#    'lianjia_dongguan.pipelines.LianjiaDongguanPipeline': 300,
#}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
MONGODB_HOST="127.0.0.1"
MONGODB_PORT=27017
MONGODB_DBNAME="lianjia"
MONGODB_DOCNAME="ershoufang"ITEM_PIPELINES={"lianjia_dongguan.pipelines.LianjiaPipeline":300}# IF Retry when fail
RETRY_ENABLED=True
# Retry many times since proxies often fail #总的次数而不是每个ip的次数
RETRY_TIMES = 1000
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 404 ,403,408,301,302]DOWNLOADER_MIDDLEWARES = {'lianjia_dongguan.middlewares.UserAgentMiddleware': 200,'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,'scrapy_proxies.RandomProxy': 100,'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# 这是存放代理IP列表的位置PROXY_LIST = 'C:/Users/wzq1643/Desktop/HTTPSip.txt'#代理模式
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0
import random
#如果使用模式2，将下面解除注释：
CUSTOM_PROXY = "http://49.81.190.209"HTTPERROR_ALLOWED_CODES = [301,302]
MEDIA_ALLOW_REDIRECTS =True

以上就是scrapy的全部代码了，去年写的，亲测还能用:

(base) PS C:\Users\wzq1643\scrapy\lianjia_dongguan\lianjia_dongguan\spiders> scrapy runspider ershoufang_spider.py

说一下这里用到的反爬技巧：
随机UA：python有个库fake_useragent可以自动生成随机UA，在middleware.py里
参考：https://blog.csdn.net/cnmnui/article/details/99852347
IP池：实力有限，我是从网上找的免费的ip，质量可能不高，放到txt里，调用方法在settings.py
参考：https://www.jianshu.com/p/c656ad21c42f
cookie池：这里没用到什么cookie池，就是多弄了几个cookie放到settings.py

最后，附上一个用pymongo从MongoDB中将这些数据导出为excel的代码：

import pandas as pd
import pymongo
#1.连接mongodb
client=pymongo.MongoClient(host='127.0.0.1' , port=27017)
#2.指定数据库和集合:如果没有则会自动创建，
db = client['lianjia']#或db=client. pymongo pymong库
collection=db.ershoufang #或collection=db["t1"] pymong库的t1集合
list=[]
for i in collection.find () :list.append(dict(i))
print(list)
df=pd.DataFrame (list)
print(df)
df.to_excel("C:/Users/wzq1643/Desktop/gz_ershoufang.xls")

excel打开如下：

python爬虫：用scrapy框架爬取链家网房价信息并存入mongodb相关推荐

python爬取南京市房价_Python的scrapy之爬取链家网房价信息并保存到本地
因为有在北京租房的打算,于是上网浏览了一下链家网站的房价,想将他们爬取下来,并保存到本地. 先看链家网的源码..房价信息都保存在 ul 下的li 里面爬虫结构: 其中封装了一个数据库处理模 ...
python爬取链家房价消息_Python的scrapy之爬取链家网房价信息并保存到本地
因为有在北京租房的打算,于是上网浏览了一下链家网站的房价,想将他们爬取下来,并保存到本地. 先看链家网的源码..房价信息都保存在 ul 下的li 里面爬虫结构: 其中封装了一个数据库处理模 ...
python爬虫爬取链家网房价信息
打开链家网页:https://sh.lianjia.com/zufang/ :用F12以页面中元素进行检查 <a target="_blank" href="/z ...
python爬取链家网实例——scrapy框架爬取-链家网的租房信息
说明: 本文适合scrapy框架的入门学习. 一.认识scrapy框架开发python爬虫有很多种方式,从程序的复杂程度的角度来说,可以分为:爬虫项目和爬虫文件. scrapy更适合做爬虫项目,ur ...
19. python爬虫——基于scrapy框架爬取网易新闻内容
python爬虫--基于scrapy框架爬取网易新闻内容 1.需求 [前期准备] 2.分析及代码实现 (1)获取五大板块详情页url (2)解析每个板块 (3)解析每个模块里的标题中详情页信息 1.需 ...
python爬虫——用Scrapy框架爬取阳光电影的所有电影
python爬虫--用Scrapy框架爬取阳光电影的所有电影 1.附上效果图 2.阳光电影网址http://www.ygdy8.net/index.html 3.先写好开始的网址 name = 'yg ...
14. python爬虫——基于scrapy框架爬取糗事百科上的段子内容
python爬虫--基于scrapy框架爬取糗事百科上的段子内容 1.需求 2.分析及实现 3.实现效果 4.进行持久化存储 (1)基于终端指令 (2)基于管道 [前置知识]python爬虫--scr ...
python+selenium爬取链家网房源信息并保存至csv
python+selenium爬取链家网房源信息并保存至csv 抓取的信息有:房源', '详细信息', '价格','楼层', '有无电梯 import csv from selenium import ...
基于python多线程和Scrapy爬取链家网房价成交信息
文章目录知识背景 Scrapy- spider 爬虫框架 SQLite数据库 python多线程爬取流程详解爬取房价信息封装数据库类,方便多线程操作数据库插入操作构建爬虫爬取数据基于百度 ...
python爬虫利用Scrapy框架爬取汽车之家奔驰图片--实战
先看一下利用scrapy框架爬取汽车之家奔驰A级的效果图 1)进入cmd命令模式下,进入想要存取爬虫代码的文件,我这里是进入e盘下的python_spider文件夹内 C:\Users\15538&g ...

python爬虫：用scrapy框架爬取链家网房价信息并存入mongodb

python爬虫：用scrapy框架爬取链家网房价信息并存入mongodb相关推荐

最新文章

热门文章