scrapy爬虫之贝壳房产信息

今天分享的代码是scrapy框架爬取贝壳房产的信息,保存到mysql中,然后通过pyecharts进行数据分析得到html,最终通过整个html得到完整的数据分析网页。还请各位看官有错指错,大家一起学习进步~话不多说,直接上代码!

一、scrapy框架代码:

1、item.py

import scrapyclass BeikeItem(scrapy.Item):# define the fields for your item here like:# name = scrapy.Field()biaoti = scrapy.Field()didian = scrapy.Field()huxing = scrapy.Field()jiage = scrapy.Field()danjia = scrapy.Field()biaoqian = scrapy.Field()mianji = scrapy.Field()

2、middlewares.py

from scrapy import signals# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapterclass BeikeSpiderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the spider middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_spider_input(self, response, spider):# Called for each response that goes through the spider# middleware and into the spider.# Should return None or raise an exception.return Nonedef process_spider_output(self, response, result, spider):# Called with the results returned from the Spider, after# it has processed the response.# Must return an iterable of Request, or item objects.for i in result:yield idef process_spider_exception(self, response, exception, spider):# Called when a spider or process_spider_input() method# (from other spider middleware) raises an exception.# Should return either None or an iterable of Request or item objects.passdef process_start_requests(self, start_requests, spider):# Called with the start requests of the spider, and works# similarly to the process_spider_output() method, except# that it doesn’t have a response associated.# Must return only requests (not items).for r in start_requests:yield rdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)class BeikeDownloaderMiddleware:# Not all methods need to be defined. If a method is not defined,# scrapy acts as if the downloader middleware does not modify the# passed objects.@classmethoddef from_crawler(cls, crawler):# This method is used by Scrapy to create your spiders.s = cls()crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)return sdef process_request(self, request, spider):# Called for each request that goes through the downloader# middleware.# Must either:# - return None: continue processing this request# - or return a Response object# - or return a Request object# - or raise IgnoreRequest: process_exception() methods of#   installed downloader middleware will be calledreturn Nonedef process_response(self, request, response, spider):# Called with the response returned from the downloader.# Must either;# - return a Response object# - return a Request object# - or raise IgnoreRequestreturn responsedef process_exception(self, request, exception, spider):# Called when a download handler or a process_request()# (from other downloader middleware) raises an exception.# Must either:# - return None: continue processing this exception# - return a Response object: stops process_exception() chain# - return a Request object: stops process_exception() chainpassdef spider_opened(self, spider):spider.logger.info('Spider opened: %s' % spider.name)

3、pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import pymysqlclass BeikePipeline:def __init__(self):self.connect = pymysql.connect(host="localhost", user="root", passwd="1234", db="beike")self.cursor = self.connect.cursor()print("数据库连接成功")def process_item(self, item, spider):print("开始保存数据")insql = "insert into beike_beijing(biaoti,didian,huxing,jiage,danjia,biaoqian,mianji) values (%s,%s,%s,%s,%s,%s,%s)"self.cursor.execute(insql, (item['biaoti'], item['didian'], item['huxing'], item['jiage'], item['danjia'],item['biaoqian'], item['mianji']))self.connect.commit()print("保存数据成功")return itemdef parse_close(self):self.connect.close()self.cursor.close()

4、settings.py

# Scrapy settings for beike project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'beike'SPIDER_MODULES = ['beike.spiders']
NEWSPIDER_MODULE = 'beike.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'beike (+http://www.yourdomain.com)'# Obey robots.txt rules
ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)
COOKIES_ENABLED = True# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en','User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36','Cookie': 'digv_extends=%7B%22utmTrackId%22%3A%2280418643%22%7D; lianjia_uuid=3d726c57-6d3f-4f6c-95a2-8b7abc9faeac; select_city=110000; lianjia_ssid=4473e3e6-43e7-4181-bb6a-ebb23ef4ec07; sajssdk_2015_cross_new_user=1; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%221769ce5aab93e7-0c28ca07c23265-59442e11-1327104-1769ce5aaba695%22%2C%22%24device_id%22%3A%221769ce5aab93e7-0c28ca07c23265-59442e11-1327104-1769ce5aaba695%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E4%BB%98%E8%B4%B9%E5%B9%BF%E5%91%8A%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fwww.baidu.com%2Fother.php%22%2C%22%24latest_referrer_host%22%3A%22www.baidu.com%22%2C%22%24latest_search_keyword%22%3A%22%E8%B4%9D%E5%A3%B3%22%2C%22%24latest_utm_source%22%3A%22baidu%22%2C%22%24latest_utm_medium%22%3A%22pinzhuan%22%2C%22%24latest_utm_campaign%22%3A%22wymoren%22%2C%22%24latest_utm_content%22%3A%22biaotimiaoshu%22%2C%22%24latest_utm_term%22%3A%22biaoti%22%7D%7D; crosSdkDT2019DeviceId=sgaxf7--ohhq7q-mq2s3hm3qk16atd-otg4hxhsr; _ga=GA1.2.999642397.1608950071; _gid=GA1.2.1417890696.1608950071; __xsptplusUT_788=1; __xsptplus788=788.1.1608950072.1608950072.1%234%7C%7C%7C%7C%7C%23%23duwbmR1LtYCy9OIqePHhHWS1htLXHyiz%23; Hm_lvt_9152f8221cb6243a53c83b956842be8a=1608950065,1608950073,1608950215; Hm_lpvt_9152f8221cb6243a53c83b956842be8a=1608950261; srcid=eyJ0Ijoie1wiZGF0YVwiOlwiM2M1NmJhYjliNzhmYzhhYzYzYWUyZGVjOWZmZWJjMjQwYzJhZmFlYmRjZTk4YWU2M2E3MDU4MjY3MDFlNDc5MThlNzkwMDI1NWM4NzNkYTA5YmQyZjBkZDFjZGIxZDg1YmJkMDlmODlmYzFkZGQxOTNiNGI3ZGU5MTU5ZmZlYWVlNWJlMjIzNTFkNzk2NDJkOTI4ZDYzYWEzNjkwYTVlNGU3MDRhMDcxYzQ5NDhmN2RiMzdjMGZiZGExZGY3NzdlZjYyMWZkOGMwMTAzMGNlZmUxNWZmYzAyMjlkODA0MTczZjE1MGRmOTFiYjZjZTgzNDEyY2JlOThjNDMwYzI1YjU2NGI2M2Q4ZTUxZjA5ZmM5MTgyMGVjZWY2OTA2ZDhkN2JiYWYxMzFkZDkxZjU3YjUxZWZhNTZjM2EyNzczMGI4ODgxNGFhNGViNjA5YjlhMjMxYmI0OWZiNzEyNzBhNFwiLFwia2V5X2lkXCI6XCIxXCIsXCJzaWduXCI6XCJjNDBjMDg1ZVwifSIsInIiOiJodHRwczovL2JqLmtlLmNvbS9lcnNob3VmYW5nL3BnMi8iLCJvcyI6IndlYiIsInYiOiIwLjEifQ==; login_ucid=2000000074667028; lianjia_token=2.0015e8780f68fe41f70445513e50d1f7b5; lianjia_token_secure=2.0015e8780f68fe41f70445513e50d1f7b5; security_ticket=WyjQtDuz1ImoP8myKzaHDGUewY7FuWIViEWxA+VfVYPS1kh3NigeIWicj7EQoTgPFJUTK6nPMHlbU+pvTlI4XRKfiyiRoeyEjIqFkcidofJneE75XwFlyXW1/eb85/AktQwvEFK2zqJHTb5owtGQiVxFGh2l/UFVDVJMjHsN4Ec='
}# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {#    'beike.middlewares.BeikeSpiderMiddleware': 543,
# }# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# DOWNLOADER_MIDDLEWARES = {#    'beike.middlewares.BeikeDownloaderMiddleware': 543,
# }# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
# EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,
# }# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {'beike.pipelines.BeikePipeline': 300,
}# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5、beikebeijing.py

import scrapy
from beike.items import BeikeItem
import timeclass BeikebeijingSpider(scrapy.Spider):n_id = 2name = 'beikebeijing'allowed_domains = ['ke.com']start_urls = ['https://bj.ke.com/ershoufang/pg1/']def parse(self, response):item = BeikeItem()for bk in response.xpath("//div[@data-component='list']/ul/li"):biaoti = bk.xpath(".//div[@class='info clear']/div[@class='title']/a/text()").extract()# time.sleep(1)if len(biaoti) > 0:item['biaoti'] = biaoti[0]else:item['biaoti'] = ""# print(biaoti)didian = bk.xpath(".//div[@class='positionInfo']/a/text()").extract()time.sleep(1)if len(didian) > 0:item['didian'] = didian[0]else:item['didian'] = ""# print(didian)huxing = bk.xpath(".//div[@class='houseInfo']/text()").extract()# time.sleep(1)if len(huxing) > 0:item['huxing'] = huxing[0]else:item['huxing'] = ""# print(huxing)jiage = bk.xpath(".//div[@class='totalPrice']/span/text()").extract()# time.sleep(1)if len(jiage) > 0:item['jiage'] = jiage[0]else:item['jiage'] = ""# print(jiage)danjia = bk.xpath(".//div[@class='unitPrice']/span/text()").extract()# time.sleep(1)if len(danjia) > 0:dj = danjia[0].replace("单价", "")item['danjia'] = dj.replace("元/平米", "")else:item['danjia'] = ""# print(danjia)biaoqian = bk.xpath(".//div[@class='tag']/span/text()").extract()time.sleep(1)if len(biaoqian) > 0:item['biaoqian'] = biaoqian[0]else:item['biaoqian'] = ""# print(biaoqian)mianji = bk.xpath(".//div[@class='houseInfo']/text()").extract()# time.sleep(1)if len(mianji) > 0:item['mianji'] = mianji[0]else:item['mianji'] = ""# print(mianji)yield itemn_url = "https://bj.ke.com/ershoufang/pg{}/".format(self.n_id)if self.n_id < 100:time.sleep(5)yield scrapy.Request(url=n_url, dont_filter=True, callback=self.parse)# print(self.page_id)self.n_id += 1

二、数据分析代码:

1.饼图.html

import pymysql
from pyecharts.charts import Bar, Pie, Line
import pyecharts.options as optsdef select_huxing_jiage():conn = pymysql.connect('localhost', 'root', '1234', 'beike')cur = conn.cursor()select_sql = "SELECT huxing,SUM(jiage) FROM beike_beijing GROUP BY huxing;"cur.execute(select_sql)result1 = cur.fetchall()# print(result1)huxing = []jiage = []for i in result1:huxing.append(i[0])jiage.append(int(i[1]))return huxing, jiagedef bingtu():  # 饼图huxing, jiage = select_huxing_jiage()c = (Pie().add("", [list(z) for z in zip(huxing, jiage)],center=["50%", "60%"]).set_global_opts(title_opts=opts.TitleOpts(title="户型价格销量比例图")).set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}")))c.render("价格饼图.html")if __name__ == '__main__':bingtu()

2.柱状图.html

import pymysql
from pyecharts.charts import Bar, Pie, Line
import pyecharts.options as optsdef select_huxing_jiage():conn = pymysql.connect('localhost', 'root', '1234', 'beike')cur = conn.cursor()select_sql = "SELECT huxing,SUM(jiage) FROM beike_beijing GROUP BY huxing;"cur.execute(select_sql)result1 = cur.fetchall()# print(result1)huxing = []jiage = []for i in result1:huxing.append(i[0])jiage.append(int(i[1]))return huxing, jiagedef zhuzhuangtu():  # 柱状图huxing, jiage = select_huxing_jiage()bar = Bar(init_opts=opts.InitOpts(width='1000px', height='600px'))bar.add_xaxis(huxing)bar.add_yaxis("销量", jiage)bar.set_global_opts(title_opts=opts.TitleOpts("各户型价格统计"))bar.set_series_opts(label_opts=opts.LabelOpts(position="top"))bar.render("价格柱状图.html")if __name__ == '__main__':zhuzhuangtu()

3.词云

from pyecharts.charts import WordCloud
import jieba
import pymysql
import wordclouddef ciyun_beike():conn = pymysql.connect('localhost', 'root', '1234', 'beike')cur = conn.cursor()select_dangdang_sql = "SELECT biaoti FROM beike_beijing;"cur.execute(select_dangdang_sql)beike = cur.fetchall()beike_list = []for i in beike:beike_list.append(i[0])dd_str = " ".join(beike_list)# print(lj)lj = wordcloud.WordCloud(font_path="词云字体.ttf", width=1000, height=1000)lj.generate(dd_str)lj.to_file("贝壳词云.png")if __name__ == '__main__':ciyun_beike()

还有其他一些数据分析图就不放了 太多了…

结果:






如果想要整个源代码的可以去这个链接下载:https://download.csdn.net/download/liuxueyingwxnl/14951927

更多案例请关注作者微信公众号:PyDream 欢迎一起交流学习!

scrapy爬虫之贝壳房产信息相关推荐

  1. 【scrapy爬虫】了解Scrapy+爬虫豆瓣电影Top250信息

    Scrapy爬虫框架 scrapy是什么 它是一个快速功能强大的开源网络爬虫框架 Github地址:https://github.com/scrapy/scrapy 官网地址:https://scra ...

  2. Scrapy爬虫框架学习_intermediate

    一.Scrapy爬虫框架介绍 Scrapy是功能强大的非常快速的网络爬虫框架,是非常重要的python第三方库.scrapy不是一个函数功能库,而是一个爬虫框架. 1.1 Scrapy库的安装 pip ...

  3. Scrapy爬虫基本使用

    一.Scrapy爬虫的第一个实例 演示HTML地址 演示HTML页面地址:http://python123.io/ws/demo.html 文件名称:demo.html 产生步骤 步骤1:建议一个Sc ...

  4. python爬虫之Scrapy框架原理及操作实例详解、股票数据Scrapy爬虫

    爬虫框架 -scrapy.pyspider.crawley等 Scrapy框架 1.scrapy框架介绍 -https://doc.scrapy.org/en/latest/ -http://scra ...

  5. 基于python的scrapy爬虫抓取京东商品信息

    这是上的第二节爬虫课程的课后作业:抓取京东某类商品的信息,这里我选择了手机品类. 使用scrapy爬虫框架,需要编写和设置的文件主要有phone.py , pipelines.py , items.p ...

  6. python爬虫代码房-Python爬虫一步步抓取房产信息

    原标题:Python爬虫一步步抓取房产信息 前言 嗯,这一篇文章更多是想分享一下我的网页分析方法.玩爬虫也快有一年了,基本代码熟悉之后,我感觉写一个爬虫最有意思的莫过于研究其网页背后的加载过程了,也就 ...

  7. 【2020-10-27】 scrapy爬虫之猎聘招聘信息爬取

    声明:本文只作学习研究,禁止用于非法用途,否则后果自负,如有侵权,请告知删除,谢谢! scrapy爬虫之猎聘招聘信息爬取 1.项目场景 目标网址:https://www.liepin.com/zhao ...

  8. Scrapy爬虫实践之搜索并获取前程无忧职位信息(基础篇)

    一.开发环境 OS:Windows 7 64bit旗舰版                Python:2.7.10                Scrapy:1.0.3               ...

  9. [Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(四) —— 应对反爬技术(选取 User-Agent、添加 IP代理池以及Cookies池 )

    上一篇:[Python 爬虫] 使用 Scrapy 爬取新浪微博用户信息(三) -- 数据的持久化--使用MongoDB存储爬取的数据 最近项目有些忙,很多需求紧急上线,所以一直没能完善< 使用 ...

  10. 使用scrapy爬虫框架爬取慕课网全部课程信息

    爬取的链接: http://www.imooc.com/course/list 爬取的内容: 课程链接, 课程的图片url, 课程的名称, 学习人数, 课程描述 1.安装scrapy模块 pip in ...

最新文章

  1. Codeforces Round #661 (Div. 3)题解
  2. 【计算机网络】网络安全 : 总结 ( 网络攻击类型 | 网络安全指标 | 数据加密模型 | 对称密钥密码体质 | 公钥密码体质 | 数字签名 | 报文鉴别 | 实体鉴别 | 各层安全 ) ★
  3. Coursera课程Python for everyone:chapter3
  4. python 系统学习实例1.1 - 华氏度与摄氏度的转换
  5. django14:CBV加入装饰器
  6. ajax跨域只能是get,jsonp跨域请求只能get变相解决方案
  7. python3 字符串格式化_Python3-字符串格式化
  8. Python精通-Python局部变量与全局变量的区别
  9. MySQL数据库的红黑树优化_为什么Mysql用B+树做索引而不用B-树或红黑树
  10. 解决sql2008附加不了2005的数据库文件的问题
  11. UIProgressView的详细使用
  12. error LNK2019: unresolved external symbol _main referenced in function ___tmainCRTStartup
  13. 为什么要搭建数据平台
  14. CMMI5认证必备条件
  15. 学堂在线 python_利用API获取【学堂在线】课堂练习答案
  16. 动手学深度学习讲义批量下载
  17. 数学建模番外篇1:PPT绘制3D图形
  18. 软件著作权申请步骤流程
  19. ARM920T中断体系结构
  20. SFP光模块高低温老化测试 高低温试验测试设备

热门文章

  1. Python之quote() unquote()使用
  2. 子群的陪集-》群的拉格朗日定理
  3. 面包板入门电子制作 学习笔记10
  4. 求方程式ax2bxc0的根c语言,2019-03-09 C语言学习12-求ax^2+bx+c=0方程的根
  5. h3 经典地图 第2辑(RPG)
  6. 最短哈密顿环 退火_模拟退火法计算最短路径 用 高效的 图论中哈密顿贿赂的 AI...
  7. 特征点法光流法直接法
  8. 计算机谈歌曲简单,计算机应用基础公开课制作一首简单的MP3歌曲.ppt
  9. 3种团队分组适应项目_分组团队竞赛活动方案
  10. 参与者模式(Actor model)